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Abstract 

We consider, as a means of making programming languages more flexible and 
powerful, a parsing algorithm in which the parser may freely modify the grammar 
while parsing. We are particularly interested in a modification of the canonical 
LR(1) parsing algorithm in which, after the reduction of certain productions, we 
examine the source sentence seen so far to determine the grammar to use to con- 
tinue parsing. A naive modification of the canonical LR(1) parsing algorithm along 
these lines cannot be guaranteed to halt; as a result, we develop a test which ex- 
amines the grammar as it changes, stopping the parse if the grammar changes in a 
way that would invalidate earlier assumptions made by the parser. With this test in 
hand, we can develop our parsing algorithm and prove that it is correct. That being 
done, we turn to earlier, related work; the idea of programming languages which 
can be extended to include new syntactic constructs has existed almost as long 
as the idea of high-level programming languages. Early efforts to construct such 
a programming language were hampered by an immature theory of formal lan- 
guages. More recent efforts to construct transformative languages relied either on 
an inefficient chain of source-to-source translators; or they have a defect, present 
in our naive parsing algorithm, in that they cannot be known to halt. The present 
algorithm does not have these undesirable properties, and as such, it should prove 
a useful foundation for a new kind of programming language. 

1 Introduction 

Programming is the enterprise of fitting the infinitely subtle subjects of algorithms 
and interfaces into the rigid confines of a formal language defined by a few unyield- 
ing rules — is it any wonder that this process can be so difficult? The first step in 
this process it the selection of the language. As we go along in this enterprise, we 
might find that our selected language is inadequate for the task at hand; at which 
point we can: forge ahead with an imperfect language, we can attempt to address 
the problematic section in a different language, or we can jettison the language 
for another with its own limitations, thereby duplicating the effort already put into 
writing the program in the first language. With ever larger, more complex pro- 
grams, we increasingly find that no single language is especially well-suited — yet 
if we try to use multiple languages, we face significant hurdles in integrating the 
languages, with rare exceptions. A fourth possibility presents itself: we could cre- 
ate a new programming language that contains all of the features we will ever need 
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in any section of the program; aside from tfie fact tlrat creating a general program- 
ming language is a monumental effort in and of itself, the resulting programming 
language will likely be a cumbersome monster. What we seek is a language that 
is at once general enough to suffice for very large programs, while also having 
specific features for each portion of the program. 

There are a great deal of mature programming languages in existence, each 
with its own advantages and disadvantages. None of these are the language we 
seek. Ideally, we would like to be able to take an existing programming language 
and — without having to duplicate the tremendous amount of effbrt which went into 
its creation and development, not to mention our own effort in learning it — mold it 
to our needs. 

We do not have time to survey the major languages, but programs in these 
languages do fit a general mold: programs must be syntactically well-formed, then 
they must be semantically well-meaning, and finally, they must specify a program 
that is free from run-time errors. Moving from the source code for a program to 
a run-time executable involves three phases: syntax analysis, semantic analysis, 
and code generation. The first two analysis phases are not separated in practice, 
but are performed in concert by a parser which is generated by a parser generator. 
The parser generator takes a grammar describing the syntax of the programming 
language, in addition to the semantic value of each production in the grammar, 
from which it produces a parser. If we had the source code to the compiler, we 
could change it to suit our purposes, producing a derived language. However, 
we must be careful if we do this, for changes to the code generator could produce 
binaries that lack compatibility with existing binaries. 

The aforementioned approach is not terribly common: its most glaring prob- 
lem is that a program written in a derived language cannot be compiled by a "nor- 
mal" compiler. An alternative is to make a new compiler wholesale — one that, 
rather than outputting a binary, outputs source code in an existing programming 
language; such a compiler is known as a source-to-source translator. The prac- 
tice of creating source-to-soiu-ce translators is much more common that the practice 
of creating derived languages; two examples are the cfront compiler for C-i-i- and a 
WSDL compiler for SOAP. These two examples illustrate an interesting point: the 
new language can share much with the target language, as is the case with cfront; 
or, the new language can share nothing with the target language, as is the case with 
a WSDL compiler. 

Creating a derived language is an attractive concept because we can directly 
leverage an existing implementation of a base language, but modifying any large 
program — the compiler, in this case — in an ad-hoc manner is not exactly an easy 
task. This approach becomes decidedly less attractive if we seek to radically alter 
the language: we will likely find that the code generator is tightly coupled with 
the parser, and that the facilities for creating abstract syntax trees have no more 
generality than is necessary for the original language. Creating a source-to-source 
translator, on the other hand, is an attractive concept because we can make a lan- 
guage that departs from the target language as much or as little as we want; how- 
ever, perhaps too much information is lost in the conversion to the target language: 
data such as debugging information, higher-order typing, optimization hints, and 
details necessary for proper error handling are just a few of the things which might 
get lost. Another problem is what I refer to as the "language tower problem": say 
we start with a language L, then we create a source-to-source translator from L-i-i- 
to L, then we create a source-to-source translator for Aspect-L++, then we create a 
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source-to-source translator for Visual Aspect-L-I-I-, ad nauseam — in short, we end 
up with far too many parsers. 

We will take an approach somewhere between these two. We would like to de- 
velop a base language that is general purpose enough to serve us in its unmodified 
form, yet can be modified at our pleasure. In light of our consideration of derived 
languages, wc will create a general-purpose framework for abstract syntax trees 
that both the base language and any derived language can use to capture the full 
range of the semantics of a program. Also, we will not require someone wishing to 
create a derived language to create an entire grammar: we will allow modifications 
of the existing grammar. We have avoided most of the problems associated with 
source-to-source translation as well: the data. As an example consider: run-time 
metadata (like profiling and/or debugging data), data type, optimization hints, and 
error messages; none of these are likely to be present in the binary if source-to- 
source translators are used. Since we are making it easy to modify the language, 
we would expect that the language tower problem would be exacerbated, but this 
is hardly the case, for there is only ever a singular parser. 

We pause to note that there must be some way of specifying the semantic ac- 
tions of a production. We can assume that these actions are specified in a pro- 
gramming language, probably the base language itself, and that the parser has an 
interpreter for that language included in its implementation. 

The code generator only understands so much of what is potentially in an ab- 
stract syntax tree. Everything else — the debugging, type, optimization, and error 
data — which gets added to the tree must be, to a large extent, ignored by the code 
generator. However, these data — we will call them extended semantics data — 
are not valueless; thus, we will allow additional analysis phases to be performed 
on the abstract syntax tree between that parsing and the code generation phases. 
Here again, an interpreter embedded in the parser will be invaluable. 

How might a language like this be used? We can use it to add gross lan- 
guage features, for example object-oriented or aspect-oriented support. Or we 
could add more behind-the-scenes features, improving for example, the optimizer. 
If we know that we are using a particular library, we can give first-class syntac- 
tic support to common patterns — for example, we could support monitors, as Java 
does with the synchronized keyword. Finally, we could create a modified gram- 
mar to eliminate repetitive code, using the modifiability of the language as a sort 
of macro processor. 

We will allow the parser to modify itself during parsing. From here on, we 
will assume that a parser operates strictly left-to-right. No longer can we treat the 
syntax analysis and semantic analysis phases as entirely separate, even conceptu- 
ally, for some part of a file may define the syntax and semantics of the remainder 
of the file. 

The study of formal languages has produced many interesting classes of lan- 
guages: regular, context-free, context-sensitive, and recursive-enumerable being 
the best known. If is a class of languages, then the set of transformative X lan- 
guages are those languages whose strings x can be decomposed as x = y\y2 . . . y„, 
such that yi is a substring of an element of one of the languages in X, which we 
term the /* instantaneous language; further, y, specifies the instantaneous lan- 
guage i -I- 1 . 

Our goal in the present work will be to develop a method of parsing a useful 
class of transformative languages. Our parser will operate much like a classical 
parser, except that, as it moves over the boundary between y, and jj+i, it will mod- 
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ify itself — more precisely, it will modify its grammar, and then its parsing tables. 
Since we are dealing with a self-modifying parser, we would run into problems if 
the parser were to backtrack from to y, — not insurmountable problems, to be 
sure, but we will find a satisfactory non-backtracking method of parsing that does 
not have these problems. 

1.1 Applications of Transformative Parsing 

Let us say that we need to write a graphical program in a language much like Java 
which can access both a web service and a database; let us assume that we do not 
have any visual rapid-development tools. We must write a lot of GUI code like 
"make a window, put a layout in the window, put the following controls in the 
layout: . . . , add a toolbar to the window, add an item to the toolbar with the label 
'x', set the callback object to 'y'" etc. We must write a lot of database like "parse 
a query, bind the following variables (...), execute the query, create a cursor, ad- 
vance the cursor, get the first column, get the second column" etc. We must write 
a lot of web service code like "create a procedure call, marshal the input, call the 
procedure, demarshall the output, handle any exceptions" etc. The GUI, database, 
and web service functionality is most likely handled by a library. Would that each 
library added syntax constructs for the operations it provides. We could declare 
the GUI with code like 

window{ layout{...}; toolbar{ item{ label = 'x'; action = y } } } 

The interesting thing is that the y identifier is bound to the correct lexical scope. 
We could process our query with code like 

queryCselect coll, col2 from tl where col=$z -> (cl, String c2) { 
} 

Here, z is a variable in the scope containing the query construct, as is cl; however, 
c2 is local to the block after the query. 

Finally, we could process our web service with code like: 

webservice service=ws_connect{ url=http : //www/ shop , id=shop }; 
a=shop . lookup (b , c) ; 

There is nothing to prevent the compiler from doing a compile-time t5?pe-safety 
check: is the coluinn col on the table tl the same type as the variable z, or is 
the column coll the same type as cl? Something similar can be done for the 
webservice. Is the second argument of the lookup method of the webservice the 
same type as c? 

The way that we arrive at the functionality requirements for these examples is 
to allow the library write to specify new syntactic constructs to simplify complex, 
error-prone tasks — along with this syntax, there must be provided a semantic de- 
scription of the new construct. A perfectly viable way to specify semantics is as 
YACC and ANTLR do: associating a block of procedural code with each produc- 
tion, which will be run after that production is used. 

We conclude with the observation that the capabilities of the hypothetical sys- 
tem under discussion in this section are not really new. Indeed, for the web service 
example at least, the capabilities are common. However, continuing with that ex- 
ample, the method used to achieve web service integration with the host language 
is by means of a WSDL compiler; one of our goals is to make obsolete artifacts 
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like WSDL compilers. What is new is a single mechanism which integrates the 
language with the program. 



1.2 Conventions 

Will will also observe some (fairly standard) typographical conventions to denote 
the type of variables. See Table Q We often deal with grammars in the sequel, 
which have either 4 or 6 components; in most cases, we will label the components 
of these grammars as (E, N, P, 5 ) and (S, N, P, 5, T, M), as appropriate. 



There are 4 main sections in the sequel. In Section|2| we present the formal defini- 
tion of a transformative language — this definition is perhaps surprisingly complex, 
but it allows us to establish Theorem|5| a result which states that parsing sentences 
in an appropriate transformative language can be done in a finite amount of time. 

The Algorithm to recognize sentences generated by a given grammar we present 
and justify in Section |21 We rely on the determination that a particular object is 
in a particular set — the object being a transformation (i.e., a change of grammar) 
being what we term "valid." In Section^ we present an Algorithm to test a trans- 
formation for membership in the aforementioned set of valid transformations. We 
spend a great deal of time proving that this Algorithm is correct. 

Finally, in Section |5| we survey related work: there are the previously men- 
tioned "extensible languages;" other investigations into parsers whose grammar 
changes while parsing, frameworks for creating derived (from Java, usually) lan- 
guages, macro systems are related in this Section. There are also a few works very 
near to the present one. 



The LR(fc) class of languages, where k>0, is the largest known class of languages 
which can be parsed without backtracking. The theory of these languages, along 



1.3 The Rest of This Document 



2 Transformative LR(1) Languages 



terminals 
nonterminals 
terminal strings 
symbol strings 
symbol 
start symbol 
production 
node 

set of nodes 

transformation 

set of transformations 



a,b,c, d,e,f 
A,B,C, D,E,F 
r, s, t, u, V, w, X, y, z 
a,/3, 'Y,6,^, 7], 0, K,A 
U,V,W,X,Y,Z 
S,S' 
n, T, 

A, B,C,D, E, F 
U,V,W,X,Y,Z 
A 

'V,T,D 
G 



a grammar 



Table 1: Typographical conventions. 



5 



with an algorithm to determine that a grammar is LR(fc), are due to Knuth [21], 
but we will primarily draw upon the presentation of this theory from |2|. We will 
not concern ourselves with the more general case of LR(fe) languages for k I. 
It shall turn out to be the case that LR(1) languages are convenient as a basis for 
creating a practical transformative language. 

We first review LR(1) parsing. We then try to extending the LR(1) parsing 
algorithm in the most naive way possible. The first attempt we make will not be 
successful, but it will illustrate a subtle problem in the development of transforma- 
tive LR(1) languages. Rectifying these problems will occupy us for much of the 
rest of this work. 

2.1 LR(1) Languages 

The LR(1) class of languages are a subset of the context-free languages. Context- 
free languages are those that are generated by a context-free grammar, which is a 
tuple (E, N, P,S), where S is the terminal alphabet, N is the nonterminal alphabet, 
P is the set of productions, and S is the start symbol. We establish the convention 
that there is a special symbol H e E, that does not appear on the right side of 
any production; this symbol is used to terminate strings in the language. We will 
specify LR(1) languages by presenting a context-free grammar for that language; 
given a context-free grammar, we cannot tell at first glance whether or not the 
grammar specifies a LR(1) grammar, rather, we must utilize the tools of LR(1) 
theory to make this determination. 

We must clarify some notation we will have occasion to use. Let G = (E, N, P, 5 ) 
be a context-free grammar. Since "— >" is a binary relation, it is in fact a subset of 
the cartesian product N x (E U N)*; this subset is P. If A ^ a is a production, 
then there is an ordered pair (A, a) e P. We have no problem with notation like: 
n = (A, a); we therefore ought to have no problem with notation like: n = A a. 
We take the symbol => to mean the replacement of the rightmost nonterminal in 
a string, and we take => to be a rightmost derivation of zero or more steps. 

One way to define LR(1) languages is via Definition|3| which, along with Def- 
initionQ is from Chapter 5 of 

Definition 1. If G = (E, N, P, 5) is a context-free grammar, then for any a e 
(E U N)*, we define FIRST^-Ca) to be the set of all y el,*, where |v| = k, such that 
Q- =>>'x for some x 6 E*. We understand FIRST(q') to mean FIRSTiCa). We call 
FIRSTCq-) the first set of a. 

Definition 2. Let G = (E, N, P, 5 ) be a context-free grammar, and let 5 ' be a 
nonterminal not in N. Define the context-free grammar (E, N U {S'}, P U |5' — > 
5 ), 5 ') as the augmented grammar associated with G. 

We are interested in augmented grammars because it is an easy way of ensuring 
that the start symbol does not appear on the right side of any production: this is a 
necessary condition for the construction of LR(fe) parser tables. 

Definition 3. Let G = (N,E,P,5) be a CFG and let G' = (N',E,P',5') be its 
augmented grammar. We say that G is LR(fc), fc > if the three conditions 

1 . 5 ' => aAw => aBw, 

G' a' 

2. S ' ySx => aflv, and 

a> a' 
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3. FIRSTa.(w) = FIRSTA.(y) 
imply that aAy = jBx. (That is, a = 7, A = S, and x = y.) 

2.1.1 Shift-Reduce Parsing 

A shift-reduce parser is a deterministic pushdown automaton with a stack of binary 
tuples, controlled by a parsing table, calculated from the language's context-free 
grammar before parsing begins. Each tuple on the stack is in the set x (Z U N U 
[e]), where A" is a finite set of integers. The parser can do one of three things: 

1. it can remove the first symbol from the input string, and put it and a state 
onto the stack, a move which we will call a shift; 

2. it can remove zero or more tuples from the stack, and replace them with a 
single new tuple, a move which will will call a reduction; or, 

3. it can halt. 

Should the automaton be in an accepting state when it halts, then we know that 
X e L(G), and we say that the parser accepts the string x. Should the parser halt 
in any other state, then we know that x i L(G), and we say that the parser rejects 
the string x. The parser is in an accepting state if and only if it just reduced by the 
production S ' — » S . 

The automaton examines the current input symbol, which will will call a, and 
takes an action based upon the value of a and the value of the integer in the tuple 
on top of the stack, which we will call k; if action[fc, a] = shift ra, then we set a 
to be the next input symbol and we push (in, a) onto the stack; if action[A:, a] = 
reduce "A — > a", then we pop lal states off of the stack and, letting (k',X) be the 
tuple on the top of the stack after popping those items off, we push (goto[A'', A],A) 
onto the stack; if action[A^, a] = error, then we halt in a non-accepting state; finally, 
if action[fc, a] = accept, then we halt in an accepting state. 

2.1.2 The Canonical Shift-Reduce Parser for an LR(1) Grammar 

As we just saw in Section D.l.ll all of the decisions on how to parse a string are 
deferred to the construction of the parsing tables. It is this construction we turn to 
now. 

Given a context-free grammar, there are many ways of constructing parsing 
tables for a shift-reduct parser. Not every method will succeed for a given context- 
free grammar, but if a grammar is LR(1), there is one method which is guaranteed 
to work: this is the original method of Knuth 1 21 1, and we refer to the parser (the 
tables used by the shift-reduce parser, specifically) produced by this method as the 
canonical LR(1) parser for that grammar. 

The previously cited source does present the algorithm for the construction of 
LR(1) parsing tables |21 1, but in an indirect form; a more direct presentation is 
O. Perhaps the most friendly presentation of this algorithm is in 1 1 chap. 4]. We 
need only summarize the algorithm here, following the presentation from 1 36j. Let 
G = (Z, N, P, 5 ) be a context-free grammar. We begin by augmenting the grammar. 
We then construct sets of LR(1) items; in general, these items are of the form 
[A — > a ■ p,a], where A —» o-yS is a production, and a el,. Intuitively, we think of 
an item as a memo to ourselves that we are trying to match the production A — > a/3, 
we have so far matched the a, and we expect to match (5 later; the meaning of the 
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"a" is this: when we have matched ajj, we reduce by A — > oyS if and only if the 
lookahead is a. We begin our construction of the item sets with the initial item 
[5' — » ■S,-\\; we let /q be the closed item set containing the initial item, where 
we define an item set to be closed if for every item of the form [A — > a • Bp, a] 
in that set, such that B — > 7 is a production, we have that [B — > -y, b\ is in that 
item set, for all b in FIRST(j0a) (see Definition 0. For item sets /j. and /„,, and a 
grammar symbol X 6 (X U N), we define the goto function on h and X to be the 
closed item set /,„, which we write as m = goto[fc, X], if [A — > a • Xp, a\ e 4 and 
[A — » aX ■ p,a] e 1„,. Finally, we define the action function for an item set 4 
and a terminal a in one of two ways: if there is an item [A — > o- • a/3, b\ eh, then 
we let action[A:, a] = shift m, where m = goto[^, a]; otherwise, if there is an item 
[A — > a-, a] e /(., then we let action[fc, a] = reduce "A a". The only exception 
to this last rule is if the item is the initial item: if we reduce by the production 
5 ' — > 5 , then we accept the string, or recognize that the string is a member of the 
language. 

We can now encode the functions goto and action into two tables, as suggested 
by the bracketed notation. For any entry on the action table corresponding to an 
undefined value of the action function, we give that entry the value of "error". 
These are the canonical LR(1) parsing tables. 

2.2 A Note on Parse Trees 

If we have a context-free grammar G = (E, N, P, 5) and we have some x el,', then 
we can prove that x e L(G) by supplying a derivation of the form 



Definition 4. If T is a tree such that: 

1 . the root of T is labeled 5 ; 

2. every interior node, with label X, the children of which are labeled ¥[,¥2, ■ ■ ■ ,Y„, 
such that X ¥[¥2 ■ ■ ■ Y„ is a. production; 

3. every leaf is labeled with a terminal or e; and 

4. the yield of T (removing e's) is x, 

then we say that T is a simple parse tree for x. 

The existence of a simple parse tree for x is a necessary and sufficient condition 
for X e UG). 

Definition 5. Let T be a tree whose nodes are labeled with terminals, productions 
or e. Let ^(T) be the root of T. For any node N, define 



5 =^ ai =^ Q'2 =^ ■ ■ ■ a„ = X. 



(1) 



a 



if N is labeled with terminal a 



^(N) 



if N is labeled with production n , 
if N is labeled with e 
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and define 



a 



if N is labeled with terminal a 



^h(N) 



A 



if N is labeled with production A — » o- . 
if N is labeled with e 
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If N is an interior node labeled witii the production B ^ a, whose children are 
Ml, M2, . . . , M,„, then we define two more functions — and ,y — as follows: de- 
fine 

■^(N) = ^h(M,)^h(M2) . . . ^h(M„,), 

and define 

^(N) = a. 

We can reformulate Definition|4|as follows. 
Definition 6. If T is a tree such that: 

1. every interior node is labeled with a production; 

2. if N is an interior node, then "^(N) = ^(N); 

3. every leaf is labeled with a terminal or e; and 

4. the yield of T (removing e's) is x, 

then we say that T is a parse tree for x. 

We developed the nonstandard definition of a parse tree in Definition|S|because 
we will have occasion, particularly in Section 13 to do nontrivial work on parse 
trees that would be impossible with simple parse trees, as removing child nodes 
from a node destroys information about that node. 

Definition 7. Let T be a tree and let N and M be nodes in T, such that N is an 
ancestor of M. One of the children, which we will call A, is a child of N such 
that either A is M itself, or A is an ancestor of M. In either case, we say that A is 
autoancestral to M. 

We finish with a note on ordering parse trees. A parse tree is inherently an 
ordered tree. If nodes Ni and N2 share then same parent, then Ni and N2 a compa- 
rable; call this partial order <t. We will find it convenient to give parse trees the 
following total order. 

Definition 8. Let an ordered tree be given, with A and B nodes in that tree. Define 
A < B if any of the following are true: 

1. A = B; 

2. B is a descendant of A; or 

3. there exist distinct nodes C and D with a shared parent and C <j D, such 
that C is autoancestral to A, and D is autoancestral to B. 

Our total ordering of the nodes in a tree suggests a method of diagramming 
trees. We can illustrate this, along with some of the other ideas of this section, 
with an example. Let G = ([a, b], {S ), P, 5 ) be a context-free grammar, with P = 
IS aSb I e]. Consider the derivation of the string aabb: 

S => aS b =^ aaS bb => aabb. 

We can represent this using the diagram in FigureQ in that diagram, nodes appear 
least to greatest from top to bottom. 
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S^a S b 

- a 

- S^a S b 

- a 

- S^E 

L b 
L b 

Figure 1 : The parse tree for the string aabb. 

2.3 Grammars and Transformations 

When the grammar is changed during a parse, we will often want to change only 
a part of the grammar, rather than the entire grammar. Therefore, we will treat 
a change of grammar as the act of adding and removing productions from the 
grammar; we call this act a transformation of grammar. Only after reduction by 
certain productions will we add or remove productions from the grammar; we call 
these productions transformative productions. 

We can begin to see how the parsing algorithm will have to be modified to parse 
a transformative language: when the parser performs a reduction, it checks to see if 
the production was a transformative production; assuming that it was, we perform 
a transformation of the grammar and calculate the parsing tables for the new gram- 
mar. The parsing stack is completely dependent upon the particulars of the parsing 
tables; hence, when we stop to perform a transformation of grammar, we will en- 
deavor to construct a new stack which will allow parsing to continue from that 
point. Lemma|S|provides a sufficent condition for a new stack to be constructed; 
we look at the exact conditions for constructing a new stack in Section l3!2l 

A method of determining which productions to add to or remove from a gram- 
mar must be supplied with the grammar. These productions are encoded in some 
manner in the portion of the sentence that the parser has already scanned, and must 
be translated into a form the parser understands. Conceivably, the parser could dic- 
tate the manner of this encoding, but we will not allow the parser to do so. Rather, 
the grammar will include a mechanism for decoding the change to the grammar. 
This mechanism will take the form of a Turing machine that will take as input the 
portion of the input already scanned, and will, upon halting, contain upon one of 
its tapes an encoding of the productions to add to or remove from the grammar in 
a form that the parser understands; the contents of this tape encode the grammar 
transformation to apply to the grammar. We will let Gt be a context-free gram- 
mar with terminal set St; grammars and grammar transformations are encoded as 
strings in L(Gj). 

We will generally use the symbol "A", or some variation on it, to represent a 
grammar transformation. Owing to this terminology, we will refer to the Turing 
machine which produces the transformation as a A-machine. An LR(1) grammar, 
a source sentence, together with a A-machine will form the input for our modified 
parsing algorithm. We will define these objects precisely. 

Definition 9. Let S be a terminal set. Let M be a Turing machine with three 
semi-infinite tapes that have tape alphabets F;, F2, and F3, respectively, where 
Fi,F2 3 S and F3 D Et. We will require that the machine does not output blanks 
on the second and third tapes; this allows us to define w 6 F^ as the contents of the 
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second tape up to the first blank cell after the machine halts; similarly, we define 
z e for the contents of the third tape up to the first blank cell after the machine 
halts. If we can guarantee that M halts and that w e L(Gt), then we say that M is 
a A-machine over L*. We define the output of a A-machine as (w, A), where A is 
the transformation encoded in z- 

Definition 10. A transformative context-free grammar (TCF grammar) is a tu- 
ple (E, N,P,S,T, M), where: 

• E is the terminal alphabet, 

• N is the nonterminal alphabet, 

• P is the production set, 

• 5 is the start symbol, 

• r is a subset of P whose elements are the transformative productions of this 
grammar, and 

• M is a A-machine over E* . 

Definition 11. Let G = (E, N, P,5, T, M) be a TCF grammar be given, and let 
Go = (E, N, P, 5) be the associated context-free grammar. Let a,fi € (E U N)* be 
such that 

a^p. (2) 

(jo 

If no transformative productions are used in ^2}, then we say that J2j is a nontrans- 
formative derivation, and we write a =^ /3. If only transformative productions 

G nt 

are used in J2j' then we say that J2j is a transformative derivation, and we write 
a B. We take to mean ==>. 

Gt G Gat 

Definition 12. If G = (E, N, P,5, T, M) is a TCF grammar, and (E,N,/',5) is 
LR(1), then we say that G is a transformative LR(1) grammar (TLR grammar). 

Definition 13. A grammar transformation for a TLR grammar G = (E, N, F, 5, T, M) 

is a tuple (A', P+, P ), where: 

» N is a set of nonterminals to add to N; 

• P+ is a set of productions to add to P, where P n P+ = Q; and 

• P_ is a set of productions to remove from P, where P_ c P. 

This immediately implies that P_ and P+ are disjoint. 
Given a grammar and a transformation, we define the symbol 

AG = (E, N U A', (PU P+)\P_,S, T, M). 

For a given TLR grammar G, let be the set of all grammar transformations A 
forG. 

A couple of observations. Note that T is constant; because T c P, we must 
have that n T = 0. We note also that Ae = (0, 0, 0) is an identity. 
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2.4 TLR Languages 



At what point should we allow the parser to stop and change the grammar? There 
are two options: after a shift, or after a reduction. We do not include the option of 
stopping before a shift because it is not substantively different from stopping after 
a shift; likewise, we do not include the option of stopping before a reduction. We 
will adopt the later option: the parser will stop and change the grammar after a 
reduction. 

As a basis for a method of parsing transformative languages, LR(1) parsing 
would seem to be ideal: LR(1) languages parse from left to right, as is required 
for transformative languages; LR(1) parsing requires no backtracking; and LR(1) 
requires single-character lookahead. 

Creating an exact definition of transformative LR(1) languages requires the 
creation of a fair amount of machinery. This will occupy us for the rest of this 
section. To see why this is work is required, we consider an example in the next 
section. 

2.4.1 A Naive Approach to IVansformative Languages 

Let us attempt to define the transformative language generated by a TLR grammar 

in the most obvious way, and see what goes wrong. 

Definition 14. If a 6 (2 U N)* is a viable prefix for G, and there is some sentential 
form aax, where a € E and x £ E*, then we say that a is a viable prefix followed 
by a. 

Definition 15. Let G = (S, N, P, 5, T, M) and G' = (S, N', /", 5, T, M) be two TLR 
grammars. Let the alphabets for the three tapes of M be Fi , Fa, and F3 , respectively. 
Let ^ £ be an encoding of G in L(Gt). Let a,p £ (S U N)* and .«,z £ 2* be 
given, and let m, w £ Fj also be given. We say that {pz, w, G') is a semiparse for 
{ax, u, G), a relationship we denote with the symbol (ax, u, G) ^ (fiz, w, G'), in 
either of two cases. The first case is that we have all of the following: 

1. pz = S, 

1. Bz=>ax, 

Gnt 

3. G' = G. 

The second case is that we have all of the following: 

1. pz= p'Bz P'yz =^ ayz = ax, where y £ E* and y £ (S U N)*, 

G t Gnt 

2. yS is a viable prefix followed by h for G', where b = FlRST(z), 

3. the output of M with input (y, u, g) is (w. A) such that G' = AG. 

We could now try to define the language generated by a TLR grammar. Surely, 
the language generated by a TLR grammar G consists of those strings x such that 
we have: 

(x, 6, G) ^ {ai,u\,Gi) (a2, U2, G2) ^ • • • ^ (S,u„,G„). 

As an example of the kind of problem this naive definition of transformative 
languages poses can be illustrated by example. We let G be the TLR grammar with 
terminal alphabet {c,d}, nonterminal alphabet {S,A,B,C,D], production set 

S^A\B, A^C, B^D, C^c, D ^ d. 
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start symbol S , transformative productions A C and B ^ D and A-machine M. 
Consider the production sets 

5 A, A -> C, S -> £», C -> B, D -» d, (3) 

and 

S ^ B, C, B ^ D, C ^ c, D ^ A. (4) 

Let Gi and G2 be the grammars that are identical to G, only with the production 
sets in and respectively. Let Ao, Ai, and A2 be grammar transformations 
such that 

Gi = AoG 
G2 = A,Gi 
Gi = A2G2 

We now let the A-machine M be such that: when the instantaneous grammar is 
G, the return value of M is Ao; when the instantaneous grammar is Gi, the return 
value of M is Ai; and finally, when the instantaneous grammar is G2, the return 
value of M is A2. 

We consider now how to parse the string "c" using our naive method: 

(rf,6,G) (S,t(i,Gi) ^ (A,M2,G2) (B,U3,Gi) (A,U4,G2)-- - ■ 

Since the value of Uj does not affect the A-machine, for any we can see that 
this sequence of semiparses has no end. This is not a theoretical difficulty, for by 
Definition 1 101 we only require that the sequence of semiparses terminates; thus, 
the string "c" is not generated by the grammar G. 

Immediately before every transformation of grammar, we have a sentential 
form which can be derived from the start symbol in the nontransformative gram- 
mar associated with the instantaneous grammar in a finite number of steps, say n 
steps. The problem that arises in this example is that, after the transformation of 
grammar, the sentential form is now derivable from the start symbol in greater than 
n steps — in this case, we will say that the parse has been extended. Analyzing a 
given A-machine to answer the question of whether or not it will always produce a 
grammar transformation that will extend the parse is not a task that we can expect 
the parser to do; indeed, this question is undecidable. 

What we can and will do is look for some property of A such that, should we 
require that any transformation emitted by the A-machine must have this property, 
then as a result, the parse will not be extended, hence it can be completed in a finite 
number of steps. 

2.4.2 Allowable Transformations 

In this section, we will construct the test the parser can perform to determine if it 
will accept or reject a transformation emitted by a A-machine. In the next section, 
we shall see if this test does indeed perform as advertised. 

The basic idea of the test is to examine the grammar, examine the symbol 
stack at the time the transformation is to be applied, and identify a certain set of 
productions called conserved productions. The transformation will be considered 
acceptable if, for each conserved production, there is a corresponding production 
in the transformed grammar, such that these two productions share a head and a 
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certain prefix, which we refer to as the conserved portion of that production. The 
function we define now determines if a production is in this set, and if so, how long 
this prefix is. 

For these definitions, we will take G = (Z, N, P, 5, T, M) to be given. 

Definition 16. Let the sentential form /? e (E U N)* be given, such that p (I,*; that 
is, there is at least one nonterminal in which we call B. We thus write fi = aBax, 
where a e (LU N)*, a € X and x € E*; we know that a exists, for at the very least 
the end-of-file marker appears after the last appearance of B. Let y € X* be such 
that aBax => y; we consider the parse tree T for y € E*. Let B and A be the nodes 
in T that correspond to the symbols B and a in aBax. Let P be an interior node 
with n children; we define N^jiP) in one of|31ways: 

1. If P is an ancestor of B, but not an ancestor of A, then let N^j(P) = n + I. 

2. If P is an ancestor of A, then one of the children of P is autoancestral to A. 
Assuming the children of P are Xi, X2, . . . , X„, let X, be the node autoances- 
tral to A; let Npj(P) = i. 

3. If P shares an ancestor with both B and A, but is an ancestor of neither, and 
B < P < A, then let N/ij(P) = n+l. 

4. If none of the above conditions holds, then let N^j(P) = -1. 
For every n e P,we define 



We call Vp the conservation function for /?. 

Proposition 1. Let the TLR grammar G = (X, N, P, 5, T, M), the production n e P, 
and the viable prefix /3, followed by a, all be given. If y, z 6 E* are such that both 
pay and paz are sentential forms for G, then V^ayi^) = V/saz- 

As a result of Proposition Q we can amend the definition of the conservation 
function: if /3 is a viable prefix followed by a, then the value of V/j„(;r) is the value 
of V/jatin) for any terminal string x such that /3ax is a sentential form. 

Definition 17. Let the sentential form /? e (E U N)* be given. If V/}{n) = -1, then 
we say that ;r is a free production for /?. If Vp(n) = i, for n equal to A — > /3, and 

< ; < ^1, then we define the first / grammar symbols on the right side of n to 
be the conserved production for /? of n. If V/s(n) = i, for n equal to A —> /3, and 

1 = ^1 + 1 , then we define n to be an entirely conserved production for /?. Entirely 
conserved productions are also conserved. 

Definition 18. Let the transformative grammar G = (E, N, P,5, T, M), and the 
sentential form /3 be given such that fi e (E U N)* , but /3 ^ E* . Let be a production, 
not necessarily in P, denoted S — > Ki K2 ■ • • ^1,1- Let ;r be a conserved production 
for yS in P. Let ;r be a production in P, denoted A ^ X1X2 . . . X„. If V^in) < n, for 
1 < ; < V/jin) we have that Xj = Yj, and A = B, then we write n-ip. If Vp(n) = n+l 

and n = tp, then we again write n-(p. If P' is a set of productions with terminal 

set E and nonterminal set N' D N, such that there exists some if) e P' for every 
conserved production n e P satisfying n-ip, then we say that P' conserves [} for 




max{%T(P) : ^(P) = ^r) there is some P such that .i^(P) = n 
-I there is no P such that J^{P) = tt 



P. 
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Definition 19. Let the transformative grammar G = (S, N, P, S, T, M) and the 
grammar transformation A € JIq be given. Let AG = (1,,Nag, Pag^S ,T, M). 
Let aA, a viable prefix followed by a, be given where a e (Lu N)*, 6 e N, and 
a € S. If, for all x e 1,' such that aAax is a sentential form of G, we have that 
Pag conserves aAax for P, then we say that A is valid for aBa in G. The set of 
all transformations which are valid for aBa we denote as "VGiaBa). We include 
Ae = (0,0,0) in "VGiaBa). 

To get a feel for how this test accomplishes our goal, we consider the manner in 
which an LR(1) parser operates. The value on the top of the state stack corresponds 
to an item set. The item set contains those productions that the parser might be able 
to reduce. The parser's initial item set is the one containing the item [S' — » -S, h] 
and zero of more other items of the form [A — » -a, a]; the parser is in the initial item 
set only when it has scanned nothing. The parser can either shift or reduce. After 
the parser shifts a terminal — for example, the terminal "a" — the parser goes into a 
new state, related to the old, as follows: for every item of the form [A — > a • afl, b] 
in the first state, there is an item of the form [A — » aa • /3, b] in the second state. 
The parser does not stop considering an item just because that item does not have 
a to the right of the dot; for each item [A ^ y ■ BS, c] that the parser is considering, 
it looks for an item of the form 

[o^-<r,g]; (5) 

should the parser be able to find a sequence of items like j5j, the last of which 
has the dot to the left of the terminal "a", then the whole sequence of productions 
remains under consideration. 

After the parser reduces a production, say C ^ d, it pops the appropriate 
number of states off of the stack, after which, "C" will be immediately to the right 
of the dot in one of the items in the new current state; the parser will move to a 
new state, in which C is to the left of the dot. 

So we can summarize the operation of the parser as follows: the parser con- 
siders several production in parallel; for each production, as it shifts terminals, it 
either: 

• advances along that production, 

• it records its position in that production, shifting its attention to those pro- 
ductions with the appropriate nonterminal at their head, or 

• it drops that production. 

This process continues until a reduction, at which time, one of the productions 
currently under consideration is selected. The parser then recalls its previous state. 
We want to view the parser stack as the parser's method of keeping track of those 
productions it is considering. In this light, requiring the transformation to be valid 
for the current viable prefix means that each production under consideration by 
the pre-transformation parser has a counterpart that is under consideration by the 
post-transformation parser — and the portion of the production that the parser has 
already matched remains unchanged. 

The current input symbol is what causes the parser to select the production to 
use in a reduction; since a grammar transformation takes place immediately after 
a reduction, the parser will have already "seen" the current input symbol. Say the 
transformative production just reduced has the nonterminal "6" at its head: we 
know that the current input symbol — say it's "a" — is in the follow set of B because 
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one of the productions the parser had been considering before it matched the trans- 
formative production — the production n, for instance — derived some string such 
that the current input symbol a follows the nonterminal B. Requiring the trans- 
formation to be valid for the current viable prefix followed by the current input 
symbol means that n has a counterpart in the post-transformation grammar that 
also derives some string such that the current input symbol a follows the nonter- 
minal S, and in this latter derivation is different only inasmuch as it affects parts of 
the string strictly after the "a". 

In other words: any part of the grammar that the parser was using right after 
the reduction by the transformative production must exist in the new grammar 
unchanged. Productions not in use, and unused suffixes of productions that were 
in use, can be modified freely, provided the grammar remains LR(1). 

2.4.3 The Language Generated by a TLR Grammar 

Since we are interested in those languages for which a parser can be constructed, 
we found it necessary to restrict those transformations that will be acceptable for 
a A-machine to emit. Now that we have given a precise description of which 
transformations will be allowed, we can define the concept of a transformative 
LR(1) language. We begin with Definition llSI the definition of "semiparse". The 
following definition is Definition ll5l with the requirement that the transformation 
be valid. 

Definition 20. Let G = (S, N, P, S, T, M) and G = (E, N', /", 5, T, M) be two TLR 
grammars. Let the alphabets for the three tapes of MbeFi, andFs, respectively. 
Let g e 'L^be an encoding of G in -L(Gt). Let a,p £ (S U N)* and z £ E* be 
given, and let u, w £ also be given. We say that (j8z, w, G') is a valid semiparse 
for (ax, u, G), a relationship we denote with the symbol (ax, u, G) (fiz, w, G'), 
in either of two cases. The first case is that we have all of the following: 

1. j8z = 5, 

2. Bz =S' ax, 

G nt 

3. G' = G. 

The second case is that we have all of the following: 

1. Bz= B' Bz => B'yz => ayz = ax, where y £ X* and y £ (S U N)*; 

G t G nl 

2. yS is a viable prefix followed by b for G', where b = FIRST(z); 

3. the output of M with input (y, u, g) is (w. A) such that: 

(a) G' = AG, and 

(b) A£^^(G). 

Earlier, we tried to define the language generated by a TLR grammar in the 
obvious way using the "semiparse" relationship (Section f2.4. 1> . and we ran into a 
computational problem. When we define the language generated by a TLR gram- 
mar, we do so in much the same way we did before, except that we require the 
semiparses to be valid. 

Definition 21. Let Go = (S, N, P, 5, T, M), a TLR grammar, and x £ S* be given. 
Let wq = 6 and q-q = If there is some k > such that 

(ao,wo,Go) >-» (ai,Wi,G,) >-^ (a2,W2,G2) >-»■■•>-» (ffj-, wj-, Gj.) = (S,Wi,,Gt), 
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then X £ L(G). The set of all strings x is the language generated by G. 

Theorem 1. If L C Y,' is a recursively enumerable language, then there is a TLR 
grammar that generates L. 

Proof. We know that there is a Turing machine T that recognizes L. We will 
construct the TLR grammar G = (S, N, P, 5, T, M) that generates L. Let E = 
{^1,02, ...,«„}, let N = {S,A,B], let 

P = {S^A, A-^ B\AB, B -> a; I 02 I • • • I a„], 

and let T = {5 — > A). We now describe the operation of M. The input on the first 
tape of M will be some x e E*. If x 6 L, then output A^; if x i L, then output 
A = (0, 0, P) — that is, the transformation that removes all productions. 

Clearly, if x 6 L, then x e L(G), but if x ^ L, then x ^ L(G) because the 
transformation A is not valid. ■ 

2.5 A Fundamental Theorem of TLR Parsing 

We saw before that the basic problem with the naive approach to defining the 
language generated by a TLR grammar is that the derivation of the string on the 
symbol stack can be extended. In the last section, we defined the language gener- 
ated by a TLR grammar using the valid semiparse relation. In this section, we will 
prove that requiring valid transformations prevents the extension of the string on 
the symbol stack. This is essential to proving that the TLR parsing algorithm, the 
subject of the next section, is correct. 

We begin with some technical Lemmas, leading up to the main Lemma of this 
section: Lemma|S| The title theorem is Theorem|5| 

Lemma 1. Let G be a TLR grammar and let aA be a viable prefix followed by c. 
IfAe ^a(aAc), and 

BBc^aAc, 

a 

then A 6 "VgQ^Bc). 

Proof By contradiction. Let AG = (S, Nag, /'ao, 5, T, M). Assume that A i 
"VoifiBc). Thus, there is some x e S* such that fjBcx is a sentential form, and 
yet A is not valid for fjBcx in G; there is some production k that is conserved for 
/3Bcx, where n appears in the derivation 

S ^pBcx, (6) 

such that for no 6 P^n do we have n - <p. 

pBcx 

Isn't n conserved for aAcxl Let y € Z* such that 

S => aAcx => y; 
a a 

use the parse tree T,, for y. Let the nodes corresponding to B and c in ^6) be B and 
C, respectively. Let U be the set of all nodes P be a node such that either: 

1. P is an ancestor of B, but is not an ancestor of C; 

2. P is an ancestor of C; or 
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3. P shares an ancestor with B and C, but is an ancestor of neither, such that 
B < P < C. 

These three possiUties correspond to the first three possibilities in Definition 1 161 
Note that 

S => BBcx aAcx =^ y. CI) 
G a G 

Let A be the node corresponding to A in the derivation JTJ. Let W be the set of all 

nodes Q such that either: 

1. Q is an ancestor of A, but is not an ancestor of C; 

2. Q is an ancestor of C; or 

3. Q shares an ancestor with A and C, but is an ancestor of neither, such that 
B < Q < C. 

Let P 6 U. Either B is an ancestor of A, or it shares an ancestor, which we will 
call T, with both A and C. We go through the three possibilities for P. 

1. P is an ancestor of B but not C. There are two ways that this can arise. 

(a) B is an ancestor of A. This means that P is an ancestor of A but not C, 
so P € W. Thus, J^{P) is entirely conserved for aAcx. 

(b) B shares T as an ancestor with A and C, but is an ancestor of neither; 
thus, P shares T as an ancestor with A and C, but is an ancestor of 
neither, so P € W, which means that ^(P) is an entirely conserved 
production for aAcx. 

Either way, A'„a«.t, (P) = NpgcA^). 

2. P is an ancestor of C, in which case P 6 W. Thus, NaAcx,T,(P) = Npgcxi^)- 

3. P shares an ancestor R with B and C, but is an ancestor of neither There are 
two ways that this can arise. 

(a) B is an ancestor of A. This means that R is an ancestor of B and C, so 
P 6 W. Therefore, J^{P) is an entirely conserved production. 

(b) B shares T as an ancestor with A and C. This means that T is an ancestor 
of A, P, and C. Therefore, we have that J^{P) is an entirely conserved 
production 

Either way, yV„A«.T,,(P) = NpBcx.j.iP)- 
We have here established that U c W, and that, for all P 6 U such that N^bcxJ, (P) > 
0, then we have that N^acxjS^) = ^/jsca-.t, (P)- It is still possible that NaAcxjy > 0. 
Thus we have the following inequality: 

NpBcxjSP)<N„AcxjAP). (8) 

The fact that P 6 U, together with Js}, lets us conclude that, for any production 
n that is conserved for fiBcx, we have that 

VpBcxif) < V.AcxiJ^)- (9) 

Since we know that A 6 "VdoAc), there is some ^ e P/^q such that ;r - 0. By 

(tAcx 

151, we conclude that n - A. ■ 

pBcx 
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Lemma 2. If 

and A e 'VaiapyBy), then 



aC => afiAz =^ ajSyByz 

G G 



aC =^ aBAx. 

AG 



furthermore, FIRST(y) = FIRST(w). 

Proof. Assume that aC ==> a/HAz => afSyByz- 

We proceed by induction on the number of steps in the derivation aC => afiAz. 
If aC =^ afiAz, then there is a production C — > fiAz. Therefore, there is 

a production C — > fiAS in AG; for any x € E* such that 6 => x, we find that 

aC afiAx. 

AG 

Assume this result for all derivations less than n steps long, for n > I. Assume 
that aC =^ afiAz in n steps. We write 

aC =^ a(Dri => a^Dv ==* a(9Auv = afiAz. 

The derivation D => 6Au is fewer than n steps long, hence the induction hypothesis 

G 

implies that 

a'D^a'eAs, 

AG 

where a' = a(. Since A € 'Vgi'^fiyBy), there must be a production C — » ^Dk in 
AG; therefore, for any / € E* such that k => t, we have 

AG 

aC => Q'Z'Z)/^ ^ Q-Z"/)/ ^ Q'Z'eA = aflAx. ■ 

AG ■ AG ■ AG ■ 



Lemma 3. // 

and A 6 'VciaAa), then 



aAya=!> aAa, (10) 

G 



aAya o-Aa. 

AG 



Proof. By induction on the number of steps in the first derivation <10> . If aAyx => ci-Ax, 

then there must be a production fi — > 6 in G, where y = B. Evidently, this produc- 
tion is conserved. 

Assume the result for derivations n > I steps long. If there are n + I steps, 
then let y = 6C. We know that y is composed entirely of nonterminals, for any 
terminals ion y would remain between A and x in the final string of the derivation. 
We have a production (possibly with e on the right-hand side) C — » ^ in G, which 
is entirely conserved. Thus 

aA6Cx =^ aA5Cx => aAx. 
G • G 

We can use the induction hypothesis to establish that 

aAd(x => aAx; 

AG 

since C — » ^ is also a production in AG, we have our result. ■ 
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Lemma 4.IfG is an TLR grammar, and 



and A e "VcifiBa), then 



aAa^pBa, (11) 

G 



aAa=^ BBa. 

AG 



Proof. By induction. If aAa =^ pBa, then there must be some production A —> 

G 

yB in G such that ay = p. In any case, this is a conserved production, so this 
production is also in AG. 

Let us assume this Lemma for derivations of length n > \ and that there are 
n + \ steps in <1 It . We can rewrite derivation <1 H as 

aAa =^ aSCia => aSCa =^ aS6Ba = BBa. 

G G G 

We can use Lemma|3|to establish that 

aSCla => aSCa. 

AG 

By Case|3|of Definition ll6l we see that the production A —> 6C( is entirely con- 
served; thus, aAa =^> adC(a. By the induction hypothesis, we have that 

aSCa^^ BBa; 

AG 



therefore, 



aAa BBa. 

AG 



Lemma 5. Let G be a TLR grammar If 



then 



aAB^aAax (12) 

G 



aAB =^ aAay, 

AG 



for some y 6 E*, provided that A 6 'Vo(aAa). 

Proof. By induction on the number of steps in <121 . If aAB aAax, then there 

G 

is a production B — » ox in G; the conserved portion of this production includes at 
least the a, therefore, there is a production B ap 'm AG. For any z e E* such that 
B z, we thus have 

AG 

aAB aAaB => aAaz. 

AG AG 

Assume the result for all derivations of length not greater than n, for some 
n> 1. If there are n + 1 steps, then let the first production used be 6 — > /?. 

If the a appears on the right side of this production — that is /? = yaS — then we 
have 

aAB => aAya6 ==> aAyax ==> aAax; 
we can use the preceding Lemma to establish that 

aAyax =^ aAax. (13) 
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Since the conserved portion of 5 ^ PyaS is at least ya, we must have a production 
B yal in AG. Thus, for any w such that i w, we have 

AO 

aAB => akyai aAyaw => aAaw, 

AG AG AG 

in light of O- 

If, however, a does not appear on the right side of the production B ^ p, then 
there must be some nonterminal C in the conserved portion of the right-hand side 
of this production deriving the a. That is, there is a production B — > r]C6 in G, 

such that nC => at for some / 6 E* . If n = 6, then note that 

o 

aAB => aAC6 o-ACs ^ aAats = aAax; 

G a a 

we can apply the induction hypothesis to the derivation aAC =^ aAat; in this case 
we have 

aAB =^ aACB ^ aACs' ^ aAat' s' , 

AG AG AG 

for appropriate f', s' e E*. Now assume that 77 # e. Note that 

aAB ==> aAnCd => aAnCu ==> aAriAavu =^ aAnavu => aAavu = aAax, 
a a a a g 

for appropriate m, v e E*. As ;/ is a nonterminal string, let us write rj = rj'E. Note 
that 

aAri'ECu =^ aAij' Eavu; 

in other words, 

a'EC^a'Eav, (14) 

o 

where o-' = o-A;;'. By LemmaQ A 6 'Vcia'Ea), thus, for every production n in 
derivation <14t . there is some production ((> such that n - (f>. Also, derivation <14t 

is not more than n steps long. Therefore, we may use the induction hypothesis to 
yield the derivation 

a EC a Eav' . 

AG 

From the previous Lemma, we get 

AB=i'Ari'ECu'^^An'Eav'u'^^Aav'u'. ■ 

AG AO AG 

Definition 22. If G = (L,N,P,S) is an LR(1) grammar and /? is a viable prefix 
followed by a, then we say that the parser for G will siiift a after n reductions of 

yS if there is a sequence of Pi,P2, ■ ■ ■ ,pn 6 (E U N)* such that 

A,^...yS2^y0i^/3o =P, 

and /3„a is a viable prefix, but for no < < n is it the case that jS^a is a viable 
prefix. 

Sometimes it will be useful to use the "converse" of the =^ symbol. 

Definition 23. If G = (E, N, P, S ) is a context-free grammar, and for some a,f} e 
(E U N)*, we have that a => p, we say that fj reduces to a, and we write /? t=^ a. 
Similarly, if a => /?, then we write /? 1=^ a. 
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If G is a transformative grammar, then we give the symbols i=> and i=> the 

G t G nt 

obvious meanings as the converses of the symbols => and =>, respectively. 

Gt Gnt 

Lemma 6. Let G be a transformative grammar and let aA be a viable prefix 
fiMowed by a. If A e 'VoiciAa), then aA is a viable prefix followed by a in AG, 
assuming AG is TLR. 

Proof. There exists some x £ S* such that aAax is a sentential form in G. We 
will show that there is some w 6 X* such that aAaw is a sentential form for AG. 
Let ? 6 S* be such that aAay => t, and consider the parse tree for f ; there must 

be one node P that is an ancestor of both the node representing A and the node 
representing a, such that the child of P that is autoancestral to the node representing 
A is different from the child of P that is autoancestral to the node representing a; 
let these two children of P be labeled C and X, respectively. Let the production 
corresponding to P be P — » jiCyXS. Note that 

X=^ay\ (15) 

7 => 6; and (16) 

C^r]A. (17) 

We use X to remind us of the possibility that X = a; that is, we could have X be 
either a terminal or a nonterminal. 

The derivations in <15> . <16> . and <17> appear in the derivation for aAax in 
sequence. Immediately before we begin deriving according to <15t . there is a sen- 
tential form OPz- We therefore have that dfir] = a and yz = x. 

In light of )17> . we have that C t]A by Lemma0 

In light of tm, we have that y =^ by LemmaBl 

AG 

From the assumption that A 6 'Vo(aAa), we can conclude that there is a pro- 
duction P /3CyX( in AG. 

We have two more things to establish: that X => au for some u el,*; and that 

AG 

there is a sentential form dPv for some v el,*. 
Since 

we have, by Lemma|5| that 



S =^ePz=^ e/3riAax, 



S =^ ePv. 

AG 

Therefore, we turn to <15> . li X = a, then we are done. So assume that X is 
a nonterminal; let X = D. We know that Cy ends is a nonterminal string, so let 
Cy = AE. By LemmaQ and Lemma|5| we see that A e 'Vo{d/3AEa). As 

e/}AED =^ epAEay, 

we therefore have 

epAED ^ epAEaw. ■ 

Theorem 2. Let G be TLR, let B ^ a be a production in G, and let A e "VdyBa). 
If ya is a viable prefix followed by a in G, such that a will be shifted after n 
reductions ofya in G, then a will be shifted after n — I reductions of yB in AG. 
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Proof. By induction on n. If n = 1, then the only reduction possible is ya yB. 
Since a can be shifted, we know that yBa is a viable prefix for G; by Lemma|6| we 
see that yBa is a viable prefix for AG. 

Assume the conclusion for some k> \, and assume that k = n + \. Now, 

ya yB 6. 
There are n steps in the reduction yB i=> i5; we can thus write 
yB = (r^B^^C^5. 

Note that 

^Ca =^ yBa, 

therefore, by LemmaQ we have A 6 'Vail^Ca). By the induction hypothesis, we 
see that the a can be shifted after n - 1 reductions of (rjB in AG. ■ 

3 TLR Algorithms 

Algorithm 1 (TLR Parsing Algorithm). Parse the string x using the LR parsing 
algorithm with the parsing table for the CFG grammar associated with G until 
such a time as a transformative production from G is reduced; at this time, apply 
a transformation to the grammar, recalculate the parsing tables, and then continue 
parsing with the new grammar. 

Input G: a TLR grammar, where G = (E, N, P, 5, T, M); x: a string over N* 

Output e: a boolean which is true only if x € L(G) 

Method 

1 . Let vva = e and Zh = e. 

2. Calculate the parse table for G. 

3. Push (0, e) onto the stack. 

4. Set a to the first terminal of x. 

5. Set s to be the state on the top of the state stack. 

6. If action[.s, a] = shift, and a = H, then return true. 

7. Otherwise, if action[.s, a] = shift, then do the following: 

(a) Push (goto[i-, a], a) onto the symbol stack. 

(b) Set wa = w^a. 

(c) Set a to the next input symbol. 

(d) GotolH 

8. Otherwise, if action[.s, a] = reduce n, then do the following (letting k = A ^ 

(a) Pop ^1 items off the stack. 

(b) \fn eT, then execute the Grammar Transformation Algorithm (Algo- 
rithm|2j; set G and the stack to the returned values. 
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(c) 
(d) 
(e) 
(f) 



Set wa = e. 

Set s' to be state on the top of the state stack. 
Push (goto[i'. A], A) onto the symbol stack. 
GotolH 



9. Otherwise, if action[.s, a] = error, then return false. 



This algorithm closely follows the presentation of the LR parsing algorithm found 
Section 4.7 of yj. Indeed, the only essential difference is in Step|8b| We turn now 
to the previously referenced Grammar Transformation Algorithm. 

Algorithm 2 (Grammar Transformation Algorithm). Compute the new gram- 
mar and its parsing tables. Assuming the transformation valid, put the parser into 
the correct state to continue parsing. 

Input G: a TLR grammar, where G = (E, N, P, 5, T, M); a: a parsing stack; vva: 
a string in S* ; za: a string in F 

Output G: a TLR Grammar; cr: a Parser Stack; z^'- a string in F 



1. Execute the A-machine with (wa, Za, G) as input, and (wa, Za, A) as output. 

2. Assert that A 6 'Vaia). 

3. SetG = AG. 

4. Calculate the parsing tables for G. 

5. Pop |(t| states off of the stack. 

6. Do the following until the stack is empty: 

(a) Let the top item on the stack be {s, X). 

(b) Set q' = Xq'. 

(c) Pop the top item off of the stack. 

7. Push (0, e) onto the stack. 

8. Do the following, for ; from 1 to \a\: 

(a) Let X be the symbol of a. 

(b) Let s be the state on the top of the stack. 

(c) Push (goto[i-, a], a) onto the stack. 

9. Return the new values for G, cr and za- ■ 

If goto[i-, a] in Step|8c|were ever undefined, then the Grammar Transforma- 
tion Algorithm fails, which will cause the TLR Parsing Algorithm to fail as well. 
However, in light of Lemma|6| we can be sure that the Grammar Transformation 
Algorithm will fail at Step|2|first. It is straightforward to give a useful (to a human) 
error message in this case. 



Method 
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3.1 Efficiency of The TLR Parsing Algorithm 



The efficiency of the Algorithm in the absence of grammar transformations is es- 
sentially that of LR parsing. The computation of LR parsing tables is expensive, 
but since the tables being generated are not wholly independent of the tables that 
were used up to the point of transformation, an incremental approach is available 
to us. That is, we need calculate only the portion of the table that has changed. 

This idea — incrementally generating parsing tables — was first introduced in 
the context of an interactive parser generator: the language designer would enter 
in productions, or modifications to productions, one at a time. After each produc- 
tion was entered, the parser generator would recalculate the affected portion of the 
parsing tables. As such a system was meant to be interactive, a high premium was 
placed on response time — hence the development of more efficient algorithms for 
computing parsing tables. 

The two options — a full generation or an incremental generation of parsing 
tables during a grammar switch — are identical for the consideration of the worst- 
case performance of a grammar switch operation, because the addition of a single 
production can cause an exponential increase in the number of parsing states 1171 . 

The TLR parsing algorithm generates canonical LR(1) parsing tables, which 
are more general, but also far larger, than the more common LALR(l) parsing ta- 
bles. It is widely quoted (see 1 1 1) that LR(1) parsing tables are much larger that 
LALR(l) parsing tables. However, LR(1) parsing tables are easy to analyze com- 
pared with LALR(l) parsing tables — hence the trade-off. It would be interesting 
to see how the ideas, algorithms and analysis presented in this work could apply to 
LALR(l) parsing. 

3.2 Correctness 

Definition 24. Let G = (E, N, P, 5 ) be an LR( 1 ) grammar, and let 6 (Z U N) for 
1 < < /J. We define g : (S U N)* -> Zt as follows: 



Theorem 3. If a is a viable prefix for G, then g(a) is defined. Furthermore, the 
items within Ig^a) are valid for a. 

Proof. Let the viable prefix a be given. 

If \a\ = 0, then g{a) = 0. Since the item set containing [5' — > -5, h] is always 
Iq by our convention (established on page|S) and since Ig is closed, the second 
conclusion is true in this case. 

Assume now that g is defined for all viable prefixes of length not more than n, 
for some n > 0, and assume that |a| = n + I; thus, we write a = a'X. Let x be 
a terminal string such that a'Xx is a sentential form. Consider the derivation of 
a'Xx: 



by the assumption that /..(q,/) contains all valid items for the viable prefix a'. 




S'^pAiz^ PyXSaZ =^ PyXyz = a'Xx. 
If 7 7^ e, then it must be true that 



(18) 



[B y ■ X6o,u\ e Ig(„,-) 



25 



What if 7 = e? Clearly, f3 = a' . There is a sequence of M steps in <18> that are 
of the form 

a''4„+iw,„+, V => a'A,„S„,w,„+iV (19) 

when going from S' to a'Xyz, where we have Ag = X and wq = y. Between each 
step of the form <19> . there is a derivation 

tt'A,„i5,„w,„+iv => Q''A„,w,„v, 

for an appropriate w„, 6 E*. If we consider the steps prior to the appearance of 
a'AMWMZ, we see that 

5 ' =^ (Cv =^ (tjAmSmV 
where (rj = a'. Since M is maximal, we see that rj i= e. There is thus an item 

[C -> 77 • Am6m, It] 6 Igia')- 

Going through our sequence of productions A„,+ i — > A,„(5„, in descending order, 
we see that 

[Am — > ■AM-iSM-i,'iM-i] € ■'g(ff')' 

in general 

[Am+l —> ■A,„6,„,ll,„} € Ig(a') 

for all < m < M because is closed. 

Consider what we have established: There is an item of the form 

Thus, since 

g{a) = g{a'X) = goXo[g{a'),Xl 

the first conclusion of this Theorem is established. 

For the second conclusion of this Theorem, we can use Theorem 5.10 of 
which justifies the construction of the item sets, and in particular, the item set 

It is clear that, in Step[S|of Algorithm|5|calculates g in a bottom-up fashion; it 
will succeed when that function is defined. Therefore, Theorem^gives sufficient 
condition for the success of that Algorithm. 

Definition 25. Let G = (E, N, P, 5) be an LR(1) grammar Let X, be a grammar 
symbol for I < i < n such that X1X2 ... X„ is a viable prefix for G. Define £P: (Lu 
N)* -> (Z* X (S U N))* as 

XiX2 ...X„^ ((0, 6), (g(Zi), Xi), (g{XiX2\ Xn), (g(XiX2 . . . X„), X„). 

Definition 26. Let G = (S, N, P, 5) be an LR(1) grammar. If or = X1X2 ... X„ is a 
sentential form for G, then let I < i < n such that Xj e N, and ^,+i^,+2 ■ ■ - X,, € Z*. 
Let = XiX2...Xi and let x = Xi+iX;+2 ■ ■ - X,,. Define the B-factorization of a as 
yS and x. 

Proposition 2. If G is an LR( 1 ) grammar, and a is a sentential form for G, then 
the B-factorizationof a is unique. 

Algoritlim 3. We modify Algorithm Q to include a viable prefix as an input pa- 
rameter; this viable prefix will be used to initialize the stack. We do this by letting 
a be the new viable prefix paramater, and we replace Step^with 
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3'. Set the stack to ^(q-). 

Lemma 7. Let Go = (S, N, P, 5, T, M). Let x e L(G). Let the semiparse sequence 
for X be 

{x,uo,G) >— > (o-i, Ml , Gi) >— > • • • (a„, u„,G„). 

For I < i < n, if a„ is B-factored into fS and x, then Algorithm\3\will accept, given 
p, X, and G, as input. 

Proof. If n - i = 0, then /? = 5 and x = H. By Lemma^ the prefix parse stack 
contains all valid items for the form S H, which is to say that the item set contains 
[S ' — » S •, h]. Reducing this is an accepting action. 

If n - i > 0, then we proceed by induction on n - If we have that n - i = 1, 
then we let v e S* such thatpx=^y. By Theorem 5.12 of 1 2 1, an LR(1) parser will 

accept y, and at some point during the parsing, the parser will have /3 on its stack, 
and X will be its unshifted input. Since 

(£!'„«,, G,) (S,u„,G„), 

Algorithm|2]will not apply any grammar transformations; instead, it will execute 
the same series of actions that an LR(1) parser would once it reaches the afore- 
mentioned configuration. Hence, Algorithm|3|will accept. 

Assume the result when the input appears as the /''-to-last form in the parse 
sequence, for some j > 0. Assume that n - i = j + 1. By Definition l20l we know 
that there is some a' such that 

ff, t==>a' i=^a,+i. 

G, G, ( 

By Theorem 5. 12 of 1 2 1, the parser will correctly trace o', a', at which point the 

parser will reduce by a transformative production. This will leave the stack string 
as the viable prefix y followed by a; since the transformation which brings G, to 
Gi+i is valid for y, we have by Lemma|6|that y is a viable prefix for G,+i. Thus, 
we can use the induction hypothesis to claim that the parser will accept y. m 

Lemma 8. Let G = (I,,J^, P,S ,T, M) be a TLR grammar If a is a viable prefix 
followed by a, and an LR( 1 ) parser for the LR( 1 ) grammar (S, N, P, S ) would shift 
a after n reductions if a is on the parsing stack as a is the lookahead, then an 
TLR parser will shift a after n reductions if a is on the parsing stack as a is the 
lookahead 

Proof. By induction on n. If n = 0, then both parsers will immediately shift a. 

Assume the result for some k > 0, and assume that n = k + I. Since both 
of the parsers will initially have identical stacks, they will reduce by the same 
production. If it happens that this production is not transformative, then the parsers 
will have identical stacks after the first reduction, after which we can apply the 
induction hypothesis. If it happens that this production is transformative, then, 
letting the new grammar by AG, we can apply the induction hypothesis by virtue 
of Theorem |2| ■ 

Theorem 4. Let G = (I.,N, P,S,T, M) be a TLR grammar The TLR Parsing 
Algorithm Q recognizes L(G). 
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Proof. Let x £ E* . 

If X e L(G), then, because Algorithm [^operates as Algorithm Q does when 
ff = e, we know that the parser will accept x, given LemmaQ 

Assume, then, that x ^ L{G). There are several ways in which this could 
happen. First, the A-machine could emit a transformation that is not valid; if this 
happens, then the parser will clearly reject x. Second, after a shift or a reduction 
by a production that is not transformative, it could be that the stack string is not 
a viable prefix, or the lookahead might not follow the stack string; in either case, 
by Theorem 5.12 of |2|, the parsing tables will call for an error action, and so the 
parser will reject the string. 

The only other possibility is that there is a sequence of tuples 

(x, uo,G) = (uq, Mo, Go) >-» (0-1,1(1,01) (02 > "2. G2) >-» • • • 

with no upper bound on the length of this sequence. We shall dispose of this 
possibility presently. Assume that such a sequence exists. Choose some / > 0, 
and let a, = /3By, where € (E U N)*, S € N, and € I*. Note that ySS = 7 is 
a viable prefix. We proceed by induction on |v|. If |v| = 1, then we note that an 
LR(I) parser would shift the first symbol of y (specifically: h) after m reductions. 
By Lemma[8| the TLR parser will shift the first symbol of y after in reductions. 
Assume that the parse always terminates for strings of length k > I, and assume 
that \y\ = k + I. The parser will, in light of Lemma|8| shift the first symbol of y 
after a finite number of reductions. At this point, we have a viable prefix, followed 
by a /^-character string, allowing us to apply the induction hypothesis. Therefore, 
the parse always completes. 

Since we have exhausted the possible reasons why x ^ L(G), we conclude that 
the parser recognizes L(G). ■ 

4 Checking the VaUdity of a Transformation 

In the previous section, we considered the set of valid transformations for a given 
viable prefix and lookahead symbol. In this section, we develop an algorithm to 
determine if a particular transformation is valid, and we provide a correctness proof 
of the same. 

4.1 Computing the Conservation Function 

We have discussed the criteria for membership of a transformation in the set of 
valid transformations; these criteria must be met by transformations emitted b the 
A-machine. It is not immediately clear how we are to determine whether or not a 
transformation is in this set. We consider a method of making this determination 
presently. 

Algorithm 4 (An Algorithm to Compute the Conservation Function ). This 
algorithm computes a conservation function much like Vp{n). It is straightforward 
to test the transformation for validity, given this function. The construction of the 
set is accomplished by tracing all of the different ways we might decide that the 
lookahead gets parsed from the start symbol. As we trace through the different 
productions in the grammar, we record our progress in sets of ordered pairs. The 
inclusion of an ordered pair like {n, k) in one of these sets, labeled Vjomcthing, means 
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that one of the procedures invoked during the execution of this Algorithm visited 
the first k symbols of ti; fortuitously, this turns out to be exactly what we need to 
generate the conservation function. 

Input G = ('L,N,P,S,T,M): a TLR grammar; cr = ({so,e),{si,Xi), . . . ,(sm,X,„)): 
a parse stack; a: a terminal called the "lookahead" 

Output Vp: a set of ordered pairs (C — > S, i), where C 6 e P, and / < |<5| 

Method 

1. Calculate the item sets for G; let them be /q, /|, . . . , /p. 

2. Let Vt be an empty set of ordered pairs of the same type as Vp. 

3. Let (s, B) be the item on the top of the stack. 

4. For every item of the form j = [A aB ■ y, h\ in /j, do the following: 

(a) Call Procedure|5|with G, the stack, J, and a as input; let Vp and / be its 
output. 

(b) Set Vt = Vt U Vp. 

(c) If / is true, then call Procedure |6| with G, the stack, j, and a as input; 
let Va be its output, and set Vt = Vt U Va. 

5. For every production, n e P, define V^ as follows: 



Procedure 5. This procedure starts from an item in the current item set and 
searches for all of the ways that the lookahead could be included by that item, 
if we assume that the production in the given item must eventually be reduced. It 
does this by considering y, the "tail" of the item in question. Each symbol of y is 
considered, continuing as long as e can be derived from the current symbol, until e 
cannot be derived. If it turns out that y => e, then we back up in the symbol stack 
to where the parser first started to consider the current item, and we recursively 
retry this Procedure from that point. Upon halting, we return Vp and /. The set of 
ordered pairs Vp records which productions we have visited during the execution 
of this Procedure, or one of the procedures invoked during its execution. The flag 
/ indicates whether we found any way of deriving the lookahead. 

Input G = (S,N,/',5,r,M): a TLR grammar; cr = ((jq, e), . . . , (i„,X,„)): 

a parse stack; a, a terminal; 7 = [A — > aB ■y,b\: an item 

Output Vp: a set of ordered pairs (C — > 6, ;), where C S e P, and i < \6\; f: a 
boolean flag 




maxj; e Z: {n, i) 6 Vt) there exists some such i 
- 1 otherwise 



6, 



Let Vp = {{n, V^{n) : n e P\. Return Vp. 



Method 



1. Calculate the item sets for G; let them be /q, /j, . . . , /p. 

2. Set /to false. 

3. Let Vz be an empty set of ordered pairs, of the same type as Vp. 

4. Let7 = y,K2...K„. 
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5. Let n be the production A — > aBy. 

6. Let ; range from 1 to n, and do the following: 

(a) If K, = a, then add {n, lo-l + 1 + /) and the contents of Vz to Vp. 

(b) If Yi is a terminal, then return Vp- 

(c) If y, is a nonterminal, then call Procedure with G, 7,, a, and as the 
input; let the output be Vp, /n, and e (we ignore H). 

(d) If e is true or /n is true, then set Vz = Vz U Vp. 

(e) If /n is true, then add {n, la] + 1 + /) and the contents of Vz to Vp, set 
Vz = and set / to true. 

(f) If e is false, then return Vp and /. 

7. Add lo-Byl + 1) to Vz. 

8. Pop lo-l + 1 items off of the stack. Let s be the state in the top item of the 
stack. 

9. Set J = Ijo), where 70 is the item [A — » -aBy, b] that is in /,. 

10. Repeat the following until no more items can be added to J. 

(a) If there is an item of the form [C — » -S, c] in J and an item of the form 
[£) — > ^ • Cr], d] in then add the latter item to J, if it is not already in 
J. 

1 1 . For each item [E 9 ■ k, e] in J, do the following: 

(a) Let (fi be the production E — > 9k. 

(b) Add the ordered pair (0, \9\ + 1) to Vz. 

(c) Call this Algorithm recursively with G, a, and the current value of the 
stack as input, along with [£ — > 9 ■ K,e] in place of j\ let Vp and fp be 
the output. 

(d) If fp is true, then set / to true, set Vz = Vz U Vp, and set Vz = 0. 

12. Return Vp and/. ■ 

Procedure 6. This procedure takes an item j in the current item set, and finds all 
of the items in one of the preceding item sets that might be reduced, if we assume 
that j must be reduced. These "ancestor" items of our item j do not need to be 
totally conserved: they only need those symbols to the left of and immediately to 
the right of the dot to be conserved. Once this procedure has popped some item 
sets olf of the stack, then it calls itself recursively. 

Input G = (L,N,P,S,T,M): a TLR grammar; cr = ((^o, e), . . . , (.s,„,X„,)): 

a parse stack; a, a terminal called the "lookahead"; J = [A — » aB ■ y,b]: an 
item 

Output Va: a set of ordered pairs (C — » S, i), where C —> 6 e P, and / < |<5| 
Method 

1. Calculate the item sets for G; let them be /q, /j, . . . , /p. 

2. Pop lo-l + 1 items off of the stack. Let s be the state in the element on the top 
of the symbol stack. 

3. Let J = [j]. 

4. Repeat the following until no more items can be added to J. 



30 



(a) If there is an item of the form [C -6, c] in J and an item of the form 
[D — » -Cf , d] in Is, then add the latter item to J, if it is not already in J. 

5. For every item k = [E t] ■ d,d] in J, do the following: 

(a) Let 7T be the production E — » rj6. 

(b) Add (tt, |;/| + 1) to Va- 

(c) If ;/ e, then call this Algorithm recursively with G and a, the current 
values of the stacks as input, along with k in place of j; let the output 

bev;. 

(d) Set Ka = Va U V;. ■ 

6. Return Va- 

Procedure 7. Determine all of the ways that a given grammar symbol can derive 
the lookahead. If the grammar symbol is a terminal, then do nothing. If the gram- 
mar symbol is a nonterminal, then return Vs, a set representing the portions of the 
productions that we visited during the execution of this and the following Proce- 
dure. This procedure calls itself recursively, so care must be taken if we are dealing 
to avoid an infinite loop; we use 11, a set of the productions that this Procedure has 
already visited, that is both input to and output from this Procedure. We also return 
two flags: the flag e is true if and only if X=>e; the flag / is true if and only if 
X => ax, for some x € X*. 

Input G = (E, N,P,S, T, M): a TLR grammar; X: a symbol in E U N; H: a set of 
productions; a: a terminal 

Output Vs : a set of ordered pairs {n, n); 11: a set of productions; e: a boolean flag; 
/: a boolean flag 

Method 

1. Set both e and / to false. 

2. If X is a terminal, then do the following: 

(a) Set Vs = 0. 

(b) lfX = a, then set / to true. 

(c) Return Vs,n,/ and e. 

3. For every production X a , such that X ^ o- ^ 11, do the following: 

(a) Add X a to n. 

(b) Execute Procedure|8|with G, a, 11, and a as input, and let V,, 11,, /, and 
k be the output. 

(c) If A: = Iffl + 1, then set e to true. 

(d) Setn = nun.. 

(e) If /» is true, then set / to true. 

(f) If f, is true or e is true, then set Vs = Vs U V,, and put {n, k) in Vs. 

4. Return Vs, n, / and e. ■ 

Procedure 8. Determine all of the ways that a given string of grammar symbols 
can derives a string beginning with the lookahead. Return this information in V,, a 
set representing the portions of the productions that we visited during the execution 
of this and the preceding Procedure. The set 11 is used for the same purpose as in 
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Procedure |7| The integer k encodes the result of this Procedure's execution as 
follows: if y=>e, then we return k = lyl + 1; otherwise, if y=>ax, for some 
X € S', then we let 1 < k < \y\\ otherwise, we let k = -1. If we have both that 
y => e and that y =^ ay, for some y 6 E*, then we let A- = lyl + 1 and we let / be 
true. 

Input G = (I, N, P,S, T, M): a TLR grammar; y: a string over (E U N)*; H: a set 
of productions; a: a terminal 

Output V,: a set of ordered pairs (n,n); IT: a set of productions; /: a boolean 
flag; k: an integer with -I < k < \y\ + I 

Method 

1. Setk= -1. 

2. Set /to false. 

3. hsly = XiX2...X„. 

4. For each / from 1 to n, do the following: 

(a) Execute Procedure |7| with G, Xj, and IT as input and Vs, Hs, e, and fs 
as output. 

(b) Setn = nuns. 

(c) If /s is true or e is true, then set k = i. 

(d) If /s is true, then set / to true. 

(e) Set V. = V.UVs. 

(f) If e is false, then go to Step|5| 

5. Return V„ 11, / and n + 1. 

6. If / is false, then set V, = and set = -1. 

7. Return V„n,f and k. U 

In Procedure|5] we call Procedure0in Stepl6cl Since calling Procedure0twice 
with the same nonterminal does not yield any new information, as an optimization, 
we could keep track of the nonterminals that have already been passed to Proce- 
dure |7| calling that Procedure only if we have not called it with that nonterminal 
before. As an additional optimization, we could retain 11 in the same step, and not 
pass to Procedure Q We leave the algorithm as it is because it makes it a little 
easier to analyze, a task which we turn to now. 

4.1.1 A Model of the Operation of the Algorithm 

We will endeavor to prove that the Algorithm is correct. This will be be exceed- 
ingly dull and difficult if we attempt to do so directly. Rather, we will model the 
operation of the algorithm in simple, formal terms in this section. With this model 
in hand, it will be possible to produce the desired proof. Note that we do not 
formally assert the equivalence of the Algorithm presented in this section and the 
model presented here: we will, however, take the "model" to be authoritative. 

Definition 27. Let the transformative context-free grammar G = (S, N, 5, T, M) 
be given such that the context-free grammar (X, N, P, 5) is LR(fc). Let the item set 
/ be given. If ; = [A — > -p, y] and 7 = [C — » y • AS, z\ are two items in /, then we 
write i>i. 
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Definition 28. Let the transformative context-free grammar G = (S, N, P, 5, T, M) 
be given, such that the context-free grammar (S,N, P,5) is LR(A:). Let the collec- 
tion of item sets for G be /q, /i, . . . , /„. Finally, let the parser stack 

((5o,e),(*i,Xi),...,(5„,XJ) 

be given. Let i and j be two items, and letl < p <m such that i e 1,^ . If either: 

1. ;'>(;or 

k 

2. j is in and it is of the form [A a ■ Xpfi,z\, such that i is of the form 

then we write / ► i. 

■' k 

Definition 29. Let the transformative context-free grammar G = (S, N, P, 5, T, M) 
be given, such that the context-free grammar (S, N, P, 5) is LR(A:). Let the collec- 
tion of item sets for G be /q, /i, . . . , /„. Finally, let the parser stack 

((5o,e),(*i,Xi),...,(5„,XJ) 

be given. Let ji, ji, ■ . ■ , jq be a sequence of items, and letpi,p2, ■ • ■ ,Pi, £ |0, 1, . . . , n), 
such that, for 1 < ; < ^, we have that p,+i -p, is or 1. If we have all of the fol- 
lowing, then we say that (71, j2, .... jq\ pi,P2, • • • ,Pq) is a fe-parse precession: 

1. ig € /j, where * = ip,; 

2. for 1 < r < q, 

ir*' jr+U 

k 

3. ji is of the form [C — > 7, j:], where |x| = k; and 

4. there do not exist indices p and q such that 7'^ = 7, and p^ = p,, and for all 
p < h < q ws have j';, > 71,+ 1 and for some p < r < q, we have that either 

k 

jp = jr or that jq = jr. 

We often omit the second component of a /:-parse precession — namely, the 
sequence of indices pi,p2, . . . ,Pq — ^unless they are explicitly called for. 

Definition 30. Let G = (X, N, P, 5, T, M) be a TLR grammar, let /3B = X1X2 . . . X,, 
be a viable prefix followed by a, and let {{so,e),{si,Xi), . . . ,{Sp, Xp)) be a parse 

stack. Let J = (Ji,j2 jn',Pi,P2,---,Pn) be a 0-parse precession, let U = 

(ui,U2, . . . ,u„;ti,T2, . . . ,T„) be a 1-parse precession, and let i = [A —> 6-y] 
be an LR(0) item. Let m, = [C; — » • a], for 1 < i < m. Consider the following 
criteria: 

1. j, = [y ^ 

2. either of the following: 

(a) m = 0, in which case, * is of the form [A — » S'B ■ y] where S'B = S, and 
we have that j„ ► s and that p„ = p - I, or 

(b) m > 0, and all of the following: 

• i is of the form [A — > d'C ■ y] for 6'C = S, 

• Urn is of the form [D —> OB ■ k, a], 

• Tm= P\ 

3. y ax, for some x £ E* ; 
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4. all 77/, => e, for A > 1. 

If all of these criteria hold, then we say that U, J, and s constitute an upward link 
to a, an ancestral link, and a sidelink to a, respectively for [}B, and we say that U 
and J join i. 

Definition 31. Let pB be a viable prefix. Let U = (ui,U2,..., u„) be a 1-parse 
precession. Let P = {pi,p2, ■ ■ ■ , p,,,] be the set of indices such that pi < pi < ■ ■ ■ < 
Pi„ and for 1 < /i < m, we have that is of the form [A — > c ■ 7] where ate, and 
we also have that, for any index hg i P, we have that 7;,^ is of the form [A — » -y]. 
We let = [A/, ai,Xi, ■ 7/,] for 1 < /? < n^, where X;, is a grammar symbol. Let 
6 = X[X2 . ■ . X,,,; we call 6 the trace of U . 

Definition 32. Let fiS be a viable prefix followed by a. Let U , J be an upward 
and an ancestral link, joining the sidelink s. Let (5y be the trace of J . There are two 
cases that we will consider: 

1. if \U\ = 0, then we know that s is of the form [A aB ■ 7]. Let 6 = SjB; 
otherwise, 

2. if |t/| > 0, then we let 6 = 6j6u, where 6u is the trace of U. 
We call 6 the trace of the tuple ((/, J, s). 

Theorem 5. Let pB be a viable prefix followed by a. Let U, J be an upward and 
an ancestral link, joining the sidelink s. The trace of{U, J, s) is pB. 

Proof. Let a = (sq, si,S2, . . . , s„) be the state stack, and let /q, /i, /2, . . . , /,„ be the 
item sets for G. Let U = (ui,U2, ■ ■ ■ , m,,; ti, T2, . . . , Tp) be a 1-parse precession, and 
let J = {ji, ji, ■ ■ ■ , jq',pi,P2, ■ ■ ■ ,pq) be a 0-parse precession. Let/? = X1X2 . . .X„. 
We will first define the LR(0) item r. 

1. if\U\ =0, then let r= s; 

2. if |t/| > 0, then let ui = [Au — » ffu • Ju, o] and let r = [An — > au ■ 7^]. 

We begin by proving that the trace of 7 is a prefix of pB that is of length p,. 
We shall proceed by induction on p,. Assume that p^j = 0. For all j e J, we know 
that j is of the form [Aj — > -cfj]. Thus, the trace of J is e. 

Assume that we know that the trace of 7 is a prefix of pB that is p, symbols 
long when |/| = nj, where n^ > 0. Let us assume that p, = nj + 1 symbols long. Let 
qg be the greatest index such that p,^ = nj. We know that p,„ = nj + 1. Thus, 

jq,*- jq.+i but 

Since j^^+i is of the form [Ag — » OgXg -yg], we must have, by Condition |2| of 
Definition|5Hl that 

Since we know, by the induction hypothesis, that the trace of J' = (juji, ■ ■ ■, jgg) 
is 

X[X2 ■ ■ ■X„., 

we conclude that the trace of J is 

X1X2 ■ . -Xij.+i. 

We now consider the possibility that |t/| = 0. In this case, we have — by Con- 
dition|2a|of Definition BOl — that p, = n - 1 . Also, we know that s is of the form 

[A, ^S,B-y,]. 
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Thus, the trace of J is /?, so by ConditionQof Definition l32l we see that the trace 
of is^B. 

If |(7| > 0, then note that 

Ti + 1. 

Let t = Tp- Pg. We shall proceed by induction on t. 

Assume that t = I. Now, as we know that t;, = pg + I for all I < h < p, we 
conclude that ui is of the form 

[C„ ^ ■ri^,a] 
and that when I < h < p,we have that m;, is of the form 

[C;, -0^,0]. 
However, we know that Up is of the form 

[C. •;/.,«]. 

Therefore, we note that \ U\ = 1, and by Condi tion l2bl of Definition l30l we note that 
^(jq) = n - I. Thus, the trace of U is B, and since the trace of J is /3, we conclude 
that the trace of ((/, J, s) is j3B. 

Assume now that the trace of |(/| is known to be 

Xn-t+2X„-r+3 ■ ■ - XnB 

when t = k, fo'c k > 1. Assume that t = k + \. Consider u^: let u\ = [A„j — > 
• Juj , a]; now let Uf = [A„, — > a,,, • 7,,,]. We know that 

jp ► Mf but ^ Uf-, 

thus, = s(ui) - 1. Since 7,, ^ Mf, we conclude that ui is of the form [A„, — > 
a'^jX^j ■ y^ij , a], where aj,!^,,, = ff,,, • As mi is a live item in /^^ , we conclude that 
Z„j = X„_,+2- Therefore, by induction, the trace of U is 

■^n-r+2'^n-r+3 • • • X„B. 

We now know that the trace of 7 is ySj = X1X2 ■ ■ ■ Xp^^ . Also, when we assume 
that \U\ > 0, we also know that the trace of U is Pu = Xt^Xt, . . .X^^^ ^B. Since 
Ti = + 1, and since r,, = /?, we have therefore established \ha\.pj[}u = /3B. m 

Definition 33. Let G = (S, N,P,S, T, M) be a TLR grammar. Let a and 7 be given, 
such that 7 => ax, where x € S*. Let 7 = . . . X„. If A; is an index such that 

X,X2...X,_,^e (20) 

and 

=> ay, 

where y el,', then (A^, 7) comprises a partial downward link to a for 7. Of course, 
if A: = 1, then we take the symbol X1X2 ■ . ■ to be a synonym for 6, in which 
case <20t is trivial. 
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Definition 34. Let G = (S, N, P, 5 , T, M) be a TLR grammai-. Let a and y be given, 
such that 7 => ax, where x € E*. Let y = X\X2 ■ ■ ■ X„,. If k is an index such that 

X1X2 ■ ■ ■ Xk-[ => € 

and Xji = a, then we say that {k, y) is a terminal link to a for 7. 

Definition 35. Let L be a partial downward link to a for 7, with value k. Let 
7 = . . . X„. If is a partial downward link to a for yj, then Lj is chained to 
L if -Yjt — » 7rf is a production. If L, is a terminal link to a for 7,, then L, is chained 
to L if X;i — > 7, is a production. If Li and L2 are two links, then we define the chain 
production to be either Xj. — > y^ or Xj^ — > 7,, as appropriate, and we represent this 
production with the symbol P(Li, L2). 

Definition 36. Let Li, L2, . . . , L„ be a sequence of chained links, where for each 
Lj = (yi,a) such that (yi,ki) is a partial downward link for 7, to a when ; < n, 
while (7,, fc,) is a terminal link for 7, to a when / = n. If, for I < j < n, there is 
most one other index I < k < n such that j t k but P(Lj,Lj+i) = P(Lk, Lyt+i), then 
we say that this sequence of strings, partial downward links and this terminal link 
comprises a complete downward link from 71 to a. 

Definition 37. Let G = (Z, N, P, 5, T, M) be a TLR grammar, and let /3 be a viable 
prefix followed by a. Let U, J he an upward and an ancestral link, joining the 
sidelink s. Letting 5 = [A — > (5 • 7], let the complete downward link D from 7 to a 
be given. Let U = (ui,U2, ■ ■ ■ ,u„;ti,T2, ■ ■ ■ ,t„) such that, for I < / < n, we have 
Uj of the form 

M, = [C, — > or, • (5;, a]. 

If we have that such that Sj => e for I < ; < «, then we call the ordered quadruple 
(U, J, s, D) a parse path for the viable prefix /3 followed by a. 

We will argue that Algorithm|4|operates by enumerating all parse paths for the 
viable prefix p followed by a. 

Let us say that we have a TLR grammar G = (E, N, P, 5, T, M), and that we are 
parsing a sentence x. The parser has just reduced by a transformative production, 
leaving the stack as ySS, with lookahead a. Algorithm|4|begins with all of the items 
in the item set on the top of the stack that are of the form [A — > /3B ■ y,h], which 
is incidentally the form of the sidelink. The Algorithm's next step is to invoke 
Procedure|5|to find the downlinks. 

Procedure |5| scans the "remainder" of the current item — that is, the portion 
to the right of the dot — to determine if this remainder can be used to derive a or 
e. The way that it makes this determination is with Procedures and [S] If these 
Procedures successfully find such a derivation, then they return the portions of 
each production that they used in Vs and V,; if they are not successful, then those 
two sets are empty. If these Procedures determine that the remainder derives 6, 
then Procedure |5| will find all items that preceeds the current item in the parse 
precession. The way that these items are found is by first "rewinding" the parse 
stack until such a time as the parse first started to consider the current item. At 
this point, we create J, which is like the closure of an item set, taken in reverse. 
We consider the items in J one at a time. For a parse precession, we do not allow 
an item to be present more than twice, unless we go to the previous item set; in 
Procedure|5| we create J first, then we call the Procedure recursively for each item 
in the set. Since, in Step[8|of the Procedure, we pop at least one item off of the 
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stack, and since we call the Procedure recursively exactly once for each item in J, 
we can be sure that the sequence of items we trace does not violate condition|31of 
Definition!^ 

We do not allow a production to appear more than twice in the complete down- 
ward link to a, so we use the production set IT, which we initialize to when we 
invoke ProcedureQin Procedure|5] 

During this process, each of the upward links to a are enumerated, as are the 
complete downward links to a, with the appropriate sidelink for one of the parse 
precessions. 

As for the ancestral links, we have Procedure|6| which is invoked only if Pro- 
cedure|5|succeeds in finding a way to derive a from an item. 

What of Vp, the set returned by Algorithmic? We have gone to some effort to 
construct a model of the operation of Algorithm|4| but we have no counterpart for 
the set Vp. We now construct a function which, given a parse path and a production 
n, returns the value of Vf{n) that the Algorithm would produce as it traces out that 
parse path. 

Definition 38. Let P = ((/, J, s, D) be a parse path for the viable prefix /3 followed 
by a. Letting U = (mi , M2, . . . , m„), we let 



= [Aa Ok ■ jk, a\ and = A^ -> akjk 



for \ <k <n. Letting J = ji, ■ ■ ■ , jm), we let: 



jk = [Ch (i, ■ Di,rji,] and (p,, = d, -> ^i,D,,r}h, 



for 1 < h < m. Letting £) = (Lj , L2, . . . , Lp), we let: 



L, = {e„k,i 



for I < i < p. We first define six functions. 
1. Define Vn: f -> Zas 




\Ok\ + ItaI + 1 ^ = ^k for some k 
-1 otherwise 



2. Define Vr 



: P ^ Z as 




A ^ 6 is used in the derivation jk => e for some k 
otherwise 



3. Define V<d : P — > Z as 




l^j^l + 1 <p = (fik for some k 
- 1 otherwise 



4. Define V^: P-^Zns 




ki,^[ if/ = P(Li,,Li,+i) for some h 
- 1 otherwise 
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5. Letting 0,, = X,,.iX,,.2 . . 



X;, , define V^j e : P — » Z as 



{' 



W + 1 



E K\s. used in the derivation Xi,j => e for some h 
otherwise 



6. Letting s = [D^ — » 7s • ^s], define Vq: P — > Z as 



l7s| + |0i| + l co = D,^yJ, 



-1 otherwise 



We have now come to our goal: define Lv,p : P — » Z as 



LvpCr) = maxIKnCr). VnM, Vt,(x), V^ixl V^,{xl VnM)- 



We call Lv p the parse path conservation function. 

We intend Vp to correspond exactly to the function Ly, and vice versa. We can 
justifiably use the output of the Algorithm to determine the validity of a transfor- 
mation if we can justifiably use the function Ly for that task. We will first recast 
the validity test for transformations in the next section, after which we will provide 
the promised justification. 

4.2 An Alternative Test for Allowable Transformations 

The conservation function of Section 12.4.21 mav consider an infinite number of 
parse trees; thus, it is not self evident that any analysis of parse paths will be able 
to reproduce the conservation function, unless an infinite number are considered. 
In this section, we consider a subset of the parse trees considered by the conserva- 
tion function which is finite in number and which does reproduce the conservation 
function. Moreover, this subset of parse trees will be "isomorphic," in a sense, to 
the set of parse paths. After constructing this set of parse trees, and establishing the 
claimed properties, we will have shown that the parse path conservation function, 
and by extension. Algorithmic correctly calculate the conservation function in a 
finite amount of time and guarantee the successful execution of Algorithm^ 

Consider Definitional wherein we define the function A^^j- In that section, 
we were given a TLR grammar and a sentential form, and we chose a sentence 
derivable from that sentential form. Proposition Q justified our choice of an ar- 
bitrary sentence derivable from the given sentential form: the portion of the tree 
consisting of nodes that were descendants of nodes representing symbols in the 
original sentential form make no contribution to the value of the function A'„.t- By 
inspection of the definition of the function N^j, we can also conclude that nodes 
that are ordered greater than the node representing the symbol a make no such con- 
tribution either. Let us investigate what would happen to Naj were we to remove 
those nodes from a parse tree. 

Definition 39. Let G = (L, N, F, 5, T, M) be a TLR grammar and let a = pBax 
be a sentential form for G such that /3 e (Z U N)' and x 6 E*, while B and a are a 
nonterminal and a terminal, repectively. Let yhe.& sentence in G such that a => y, 
and let T be the parse tree for y. Let Ai, A2, . . . , A„ be the nodes representing the 
symbols pB, and let B and A represent the B and a, as they appear in a, respectively. 
We define two operators /j and /j' which acts on trees. The action of /j' is to remove 
all nodes X if either: 
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1. X is a descendent of some node Y, where Y = A, for some 1 < i < n; 
otherwise 



2. X is not a descendent of any node Y, where Y = A,- for some I < i < n, and 
in addition X > A. 

Let juT be the tree formed from ju'T by replacing every leaf node labeled by a 
production C y with a node labeled C. The tree juT we shall refer to as the 
simplified tree for ySSa. The operator // is the simple-tree projection operator. 

Definition 40. Let G = (E, N, P, 5 ) be a context-free grammar. Let T be a tree 
labeled with productions and grammar symbols from G, along with e. Let N be an 
interior node in T that is labeled with the production A — » a. If the child-string of 
N is a prefix of a, then we say that N is parse-proper. 

Proposition 3. Every simplified tree is parse-proper. 

Definition 41. Let T be a simpUfled tree for JiBax. Let B and A be the nodes 
corresponding to B and a, respectively. Let U be the set of nodes that are ancestral 
to B. Let 

Z ={C in T : C corresponds to one of the symbols in ySB) 
U {C in T : C is the least not autoancestral to A but not B) 

For every C e Z, the parent of C is in U. Let W c U be such that, for every node 
D £ W, there exists some C 6 Z such that D is the parent of C. Let us put the 
elements of i nto an ascending sequence we call the prefix-ancestral sequence: 
this sequence is (Mq, Mi, ... , M„). For < i < m, define 




(FeU:F<Mi) j = 

{F€U: M,<F<M,+i} l<i<m' 



Call (V;)"q the prefix-ancestral interstitial sequence for B and A. If, for all 

< i < m, there are no more than two distinct nodes Ai and Az in V,- such that 
.if(Ai) = ^(Az), then we say that T is proper above B. 

Definition 42. Let T be a simplified tree for fiBa. Let B and A be the nodes 
corresponding to B and a, respectively. Let X be the set of nodes that are ancestral 
to A but not B. If there exist no more than two nodes Di , D2 e X such that ^(Di ) - 
^(□2), then we say that T is proper above A. 

Definition 43. Let T be a simplified tree for flBa. Let B and A be the nodes 
corresponding to B and a, respectively. If T is proper above both A and B, then we 
say that T is a proper simplified tree for fiBax. 

Definition 44. Let G = (S, N, P, 5, T, M) be a TLR grammar and let /? be a viable 
prefix followed by the terminal a, for the nonterminal fi is a nonterminal. Define a 
set of trees which we will refer to as the simpUfied tree set for jia as follows: 

= [T : T' is a simpUfied tree for fia}. 

We also define the following set of trees, which we will refer to as the proper 
simpUfied tree set for fia: 

T/3a = {T' : T' is a proper simplified tree for fia}. 
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Definition 45. Let G = (E, N, P, 5 ) be an LR(1) grammar. Let /? be a viable prefix 
followed by a. Let T € r^^. Let P € T, and let the children of P be Ai, A2, . . . , A„. 
Define 



n + 1 P is an ancestor of B but not A 



Hj(P) = 



i P is an ancestor of A, and A, is autoancestral to A 

n + 1 P shares an ancestor with both A and B, and B < P < A 



For n €. P and T € define 



M(J,7f) = max{FT(P): -5?(P) = n}. 



We define the simplified conservation function for j3a; 



M(T, n) there exists some T e containing P such that ^(P) = n 
-1 otherwise 



Finally, we define the proper simplified conservation function for /3a; 



Definition 46. Let T be a tree, and let A, B, and C be three nodes such that A is 
ancestral to B, which is ancestral to C. A function which takes T, along with A, B, 
and C as argument, whose range is {0, 1} we refer to as a tree-projection decision 
function. 

Definition 47. Let T be a tree. We call / as a triplet location function if 



where A, B, and C are all nodes of T, such that A is ancestral to B, which is ancestral 
toC. 

Since we give the nodes of a tree a total order, we can create a bijection between 
the nodes of a tree and the first n integers; thus, the return values of the triplet 
location function and the last three arguments to a tree-projection decision function 

are all elements of Z„_i. 

Definition 48. Let G = (S, N,P,S) be an LR(1) grammar, and let fiB be a viable 
prefix followed by a. Let T be a parse-proper tree with yield /3Ba, letting B and 
A be the nodes corresponding to B and a, respectively. Let p be a tree-projection 
decision function, and let // be a triplet location function. We define an operator 
ITpj, which will transform T. If fi(T) = 0, then ITp,, has no effect. Assume instead 
that /<(T) = (A, B, C); let the parents of A, B, and C be Pa, Pb, and Pc, respectively, 
should they all exist — in particular. Pa. The operation of Up^ is as follows. 

1 . If p(T, A, B, C) = 0, then do one of the following: 

(a) if A is the root of T, then make B the new root, but 

(b) if A is not the root of T, then remove A as a child of Pa, and change the 




M(T, n) there exists some J €T^ containing P such that ^(P) = n 
-1 otherwise 




parent of B to Pa. 
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2. Otherwise, if p(T, A, B, C) = 1, then remove B as a child of Pb, and we 
change the parent of C to Pb- 

We call Ilp^ the tree triplet surgery operator. 

If X is a totally ordered set, then we use the following total order onXxXx 
■ ■ xX = X". Let(ai,a2. • • ■ ,a„),(bi,b2, ...,b„)e X"; wesay that (01,02. . . . < 
(bi,b2, . . . ,bn) iri either of the following cases: 

1. a, = bi for 1 < ! < j <n, and aj < bf, or 

2. a, = bi for I < i < n. 

Definition 49. Let G = (S, N, P, 5) be an LR(1) grammar, and let /3S be a viable 
prefix followed by a. Let T be a parse-proper tree with yield pBa, letting B and 
A be the nodes corresponding to B and a, respectively. Let W = (Mq, Mi, ... , M„) 
be the prefix-ancestral sequence for B and A, and let (Vi)"^Q be the prefix-ancestral 
interstitial sequence for B and A. We let < j < m be the least index such that 
there are nodes Nj, N2, N3 € such that 

Ni < N2 < N3, and 
=Sf(Ni) = ^(N2) = ^(Nj). 

We say that the three nodes (Ni, N2, N3) are a repetitive triple for Vy. If no such 
j exists, then let /<b(T) - 0. If such a j does exist, let (Ni, N2, N3) be a repetitive 
triple for Vj, such that there does not exist a repetitive triple (Mi , M2, M3) satisfying 

(Mi,M2,M3) < (Ni,N2,N3). 

Call the repetitive triple (Ni,N2, N3) the active repetitive triple for T, and let 

//b(T) = (Ni,N2,N3). 

Definition 50. Let G = (E, N, P, S ) be an LR(1) grammar, and let jSB be a viable 
prefix followed by a. Let T be a parse-proper tree with yield fiBa, letting B and 

A be the nodes corresponding to B and a, respectively. Let X be the set of nodes 
ancestral to A, but not ancestral to B. Let Ci , C2, C3 e X be three nodes such that 

Ci < C2 < C3, and 

^(Ci) = if(C2) = ^(C3), 

Call (61,02,03) a lookahead repetitive triple (LA-repetitive triple) for T. If 
(Ci . C2. Ci) is a LA-repetitive triple for T, such that there does not exist a looka- 
head repetitive triple (Di, D2, D3) satisfying 

(Di,D2,D3)<(0,, 02,03), 

then we refer to (Oi, O2, 03) as the active LA-repetitive triple for T. If there is 
no active LA-repetitive triple in T, then jUa(T) = 0; if (61,02,03) is the active 
LA-repetitive triple, then let //a(T) = (Oi , O2, 03). 

We will use one of only two constructions for tree-projection decision functions 
in the present work. Let po be such that po(T, A, B, 0) = always. Our other 
construction for a tree-projection decision function is more complex. Let G = 
(E, N, S ) be an LR(1) granmiar, and let f5B be a viable prefix followed by a. Let 
T be a parse-proper tree with yield fiBa, letting B and A be the nodes corresponding 
to B and a, respectively. Let n e P, where tt = A a, and let 1 < n < |a| -1- 1. Let 
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Ci, C2, and C3 be three nodes such that Ci is ancestral to C2, which is ancestral 
to C3. We define p„^(T, Ci, C2, C3) presently: if there is a node N ancestral to C2, 
such that 

• Ci < N < C2, 

• ^(N) = ;r, and 

• Ht{H) = n, 

then let p„^{T, Ci, C2, C3) = 0; otherwise, let p„^{T, Ci, C2, C3) = 1. 

Definition 51. Let G = (L, N, P, S) be an LR(1) grammar, and let pB be a viable 

prefix followed by a. Let T be a parse-proper tree with yield fiBa, letting B and 
A be the nodes corresponding to B and a, respectively. Let p be a tree-projection 
decision function. We now define two special tree triplet surgery operators; let 
% = ^p.fiB and let Ap = LIp,,^. Let p and q be such that O^^'T = (fT and 
Ap*'T = A'T, respectively; refer to p and q as the O-limit and A-limit for T of 
Op and Ap, respectively. Define *Pp = A^O^, an operator we refer to as the proper 
projection operator. 

We will establish the following conventions. Let G = (E, N, P, 5) be an LR(1) 
grammar, and let /3B be a viable prefix followed by a. If T is a parse-proper tree 
and P is a node in T, then we will use the symbol "Pt.p to mean the operator *Pp, 
with p = p^„, where n = ^(P) and n = Hj{P). We use the symbol "Pq to mean the 

operator Tp^ . 

Lemma 9. Let G = {I,,N,P,S) be an LR( 1 ) grammar, and let pB be a viable prefix 
followed by a. If J is a parse-proper tree with yield fiBa, then Op T is proper above 
B, where p is the <b-limit for T. 

Proof. Let G = (S, N, P, 5) be an LR(1) grammar, and let f3B be a viable prefix, 
followed by a, for B € N. Let T be a parse-proper tree with yield pBa. Let p be a 
tree-projection decision function, and let p be the O-limit of O^ for T. Let (V,)"^o 
be the prefix-ancestral interstitial sequence for B and A. 

We proceed by induction on p. If p = 0, then there are no repetitive triples 
for T. There are thus no more than 2 distinct nodes Bi and B2 such that .if(Bi) = 
^(B2). Therefore, T is proper above B. Since OpT = T, we conclude that OpT is 
proper above B. 

Assume that Op^Z is proper above B for every tree Z with O-limit of pz, for 
some pz > 0. Assume also that p = pz + I. Let (Bi, B2, B3) be the active repetitive 
triple for T. We replace one of these three nodes with one of the remaining two; 
since ^(Bi) = ^(B2) = ^(B3), we conclude that OpT is parse proper. As the 
O-limit of Op for OpT is p - 1, the induction hypothesis implies that Op ' OpT is 
proper above B. ■ 

Lemma 10. Let G = (S, N, P, 5 ) be an LR( 1 ) grammar, and let flB be a viable 
prefix followed by a. Ifl is a parse-proper tree with yield pBa, then ApT is proper 
above A, where q is the A-limit. 

Proof. Let G = (S, N, P, S) be an LR(1) grammar, and let j8S be a viable prefix, 
followed by a, for S e N. Let T be a parse-proper tree with yield pBa. Let p be a 
tree-projection decision function, and let q be the A-limit of Ap for T. Let X be the 
set of all nodes ancestral to A but not B. 
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We proceed by induction on q.If q = 0, tiien there are no LA-repetitive triples 
for T. Thus, there are at most two nodes D and D' in X such that ^(D) = J!?{D'). 
Therefore, T is proper above A. Since ApT = T, we conclude that ApT is proper 
above A. 

Assume that, for any tree Y with a A-limit of Ap that is qy > 0, that we know 
that Ap^'Y is proper above A. Assume also that q = q^ + I. Let (Ci, C2, C3) be 
the active LA-repetitive triple for T. We know that ^(Ci) = ^(C2) = ^(Cs), 
and since whichever node is replaced gets replaced by one with the same label, we 
have a parse-proper tree in ApT. The A-limit of Ap for ApT is q - 1. We therefore 
have, by the induction hypothesis, that A'"' ApT is proper above A. ■ 

Let us consider an example. Let G = (E, N, P, 5 ) be the LR( 1 ) grammar with 

E = [q, r, h, j, c, b, k], 

N = {5,//,G,A,fi,Aei, and 

P = IS^H, 

H QGr I Ak, 

G^Gh\ GjH, 

A -» cAD I B, 

We now consider the string x = qjqjqjccccbkrrhhhr. It is easily verified that 
X € L(G). Let T be the parse tree for x. With the order that we have given trees 
in this work, we can represent T graphically as in Figure |2| We now consider 
a simplified tree and a proper simplified tree for the viable prefix qjqjqjccccB, 
when followed by k; we have presented these trees in Figure^ 

We care about proper simplified trees because Z^a reproduces the conservation 
function, yet is finite. 

Theorem 6. Let G = (L, N, P, 5, T, M). The proper simplified tree set is finite. 

Proof Let G = (E, N, P, 5) be an LR(1) grammar, and let fUB be a viable prefix 
followed by a, for 6 6 N. If will suffice to show that there is an upper bound to the 
height for elements of Tpga- 

Let T 6 TpBa. Let B and A be nodes in T corresponding to B and a, respectively. 
Let (V,)"^(, be the prefix-ancestral interstitial sequence for B and A. 

Let O € N such that 

D^e, (21) 

where, for no D' € N with D' D 'ls it the case that D' => 6 is a longer derivation 
than IIT). Let No be the length of CT . 

Let X be the greatest node ancestral to both B and A. Let Ba be the child of X 
autoancestral to B and let Q be those children of X greater than B^. The greatest 
child of X is in Q; let this child be Aq. 

Every element in Q \ {Aq) has a height not greater than No- Let Ya be the 
subtree rooted at Aq. Let Xa be the set of nodes in Ya ancestral to A. By the 
condition that T is proper above A, we have that |Xa| < 2\P\. It is possible that the 
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h 

L H^Q G r 
Q^q 

G^ H 



■J 



A^C A D 

c 
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c 
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A 



A D 
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A^B 
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D- 
L , 



D- 

L , 

D->E 



Figure 2: The parse tree for the string qjqjqjccccbkrrhhhr. 

height of Ya could exceed |Xa| + 1, by not by No- That is, the height of Ya is less 
than |XaI + No- 

Let Zb be the subtree with Ba as its root. Let y be the yield of Zb. Let Ny c N 
such that for all C e N^, we have that C => y. Letting = (Ci, C2, . . . , C„l, let 
hi be the height of the parse tree corresponding to the derivation C, =!> y. Finally, 
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D- 
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Figure 3: The simplified, and the proper simplified, trees for the viable prefix 
qjqjqjccccB, in former case, followed by the suffix krrhhhr, and in the later case, 
with lookahead k. 



let Mb = max{huh2, h„}; clearly, the height of Zb is less than Mb. 

Let 5 6 (X U N)' such that 6y = pB. Within each V,, for < / < \S\, there are 
not more than 2\P\ nodes, by the condition that T is proper above B. Therefore, 
if we let Ts be the tree formed by removing from T all those nodes in Q, we can 
conclude that the height of Ts is not more than 2|P||(5|. 

Combining these results, we see that the height of T is not greater than 

W|+max{MB,|XA| + iVB|}. ■ 

Definition 52. Let G = (S, N, P, 5, T, M). Let Pq be a production set over the 
terminal alphabet E and the nonterminal alphabet N' d N. If ;r = A ^ X1X2 ...X^ 
is a production in P such that FySaW 5^-1, and either 

• ;r e Pq if ^pai^) = m + 1; or 
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• there is some (f> e Po such that (/> = A ^ X1X2 . . .X1S1S2 ■ ■ ■ S,, where ; = 
K/3aW if }'/3<,W < m + 1; 

then we say that Pq simply conserves pa for P. If we replace Y^a with Zpa, then 
we say that Pq simply conserves fia for P properly. 

Lemma 11. Let the LR( 1 ) grammar G = (E, N, P, S ), the viable prefix fi ending in 
a nonterminal, and the terminal a, which follows p all be given. Let T € Fg^- For 
any node P in T, such that Hj(P) -I, there exists a tree T € T/sa such that, for 
some node P' in T, we have that 

Ht(P) = Hv(P'). 

Proof. Let /3 = aB, where B is a nonterminal. Let T € F^„, and let P in T, such 
that Hj(P) t - I. Let B and A correspond to B and a, respectively. 

Let U be the set of nodes ancestral to B. Let W = (Mo, Mi, ... , M,„) be the 
prefix-ancestral sequence for B and A, and let (V,)"^^ be the prefix-ancestral inter- 
stitial sequence for B and A. 

Let n = Tj.p. It can easily be verified by inspecting the definition of Typ that 
there is a node P' in QT, corresponding to P, such that Hqj(P') = Hj(P). 

The simplified tree QT is proper above A and it is proper above B (Lemmas|5| 
and llOt . Since there is some node Pp in QT such that //nrCPp) = ^t(P), we have 
our conclusion. ■ 

Corollary 1. Let the TLR grammar G = (S, N, P, 5, T, M), the viable prefix fl, the 
terminal a, which follows p, and a production set Pq all be given. Then Pq simply 
conserves Pa for P if and only if it does so properly. 

Theorem 7. Let the TLR grammar G = (E, N,P,S, T, M), the viable prefix p, the 
terminal a, which follows p, and the grammar transformation A € be given. 
Let AG = (L,Nag,Pag,S,T,M). Then A e "Vaifia) if and only if P^a simply 
conserves Pa for P. 

Proof. Let A € "Vcipa). Let x € E* be such that pax is a sentential form, and let 
;y 6 E* be such that pax =^ y. Let T be the parse tree for y, and let B and A be the 
nodes in T corresponding to the symbols B and a, respectively. 

Let ;r be a free production; that is: V^a(7r) = -1. Thus, there are no nodes P in 
T meeting any of the first three criteria from Definition ll6l Therefore, there are no 
simplified trees with a node P such that ^(P) = n. 

Let 71 = A a he a conserved production that is not entirely conserved; thus, 
^/Ja('r) = n, for -1 9!: n < |a|. There is some node P ancestral to A such that 
Jif(P) = n, where the child of P is autoancestral to A; this node does not get 
removed from T by yCi (Definition l39> . Therefore, in the tree pT, there is a node Pq 
such that //ts(Po) = n. We consider the possibility that Yp„(n) > n. 

Assume that Y^a(^) = no > n. That is, assume that there is some Ti 6 F^^ — 
letting B) and Ai be the nodes corresponding to the symbols B and a, respectively — 
such that there is a node F in Tj that whose no* child is autoancestral to Ai. As /3 
is a viable prefix followed by a, there exists some Xi € E* such that Paxi is a sen- 
tential form. Let z 6 E* such that paxi z; let T, be the parse tree for z . There 
is an obvious injective mapping of nodes g : Ti — > T- that preserves the structure 
of Ti. The node ^(F) in T; is such that N/ja^jAliP)) = "o, a contradiction. Thus, 
Y/iain) = n. 
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Let = C — » 7 be entirely conserved; since ^ e Pag, we have that Pag simply 
conserves /3a for P. Therefore, the "If" direction is proved. 

Let Pag simply conserve fia for P. Let ;r = A — > a be such that 

< |ff|. 

Is it possible that Vi3„(n) > Yp„(n)l Assume one such x exists. Let y € E* such that 

/3ax y, and let T be the parse tree for y. We have asumed that there exists some 
node P in T such that 

But, since Ni}„yj(P) > -1, it must be the case that either: 

1 . P is an ancestor of B; 

2. P is an ancestor of A; or 

3. P shares an ancestor with A and B, but is an ancestor of neither, such that 
B < P < A. 

In any of these three cases, we would have that P would not be removed from T by 
fi. Hence, there is some To 6 F/sa such that there exists Pq in To such that 

//to(Po) > -1 

and ^(P) = 7T,a. contradiction. 

Let n be such that Y^ain) = n + I. Since n e Pag, we have that Pag conserves 
Pa for P. ■ 

4.3 The Connection Between Parse Paths and Proper Sim- 
pUfied Trees 

We are now ready to justify the method of Algorithm|31as a means of determining 
if a grammar transformation is valid. We have modeled the operation of the Algo- 
rithm as an enumeration of parse paths, and we have examined a new formulation 
for determining if a transformation is valid. We now show that the method of Sec- 
tion l4.2l is just another way of looking at the operation of the Algorithm, in that 
each proper simplified tree corresponds to a parse tree, and visa versa. 

4.3.1 A Mapping of Parse Paths to Proper Simplified Trees 

Let G be a TLR grammar. We let Tq be the set of all trees whose nodes are labeled 
either with a terminal, nonterminal or production from G. Let the set of all parse 
paths for the viable prefix /3, followed by the terminal a be P/jb.o.g- Additionally, 
let 

Si3B.a,a = [J s, and 

((/,y,i.D)eF^f,,„.G 

%B.fl.G = [J D. 
(t/,J,j.D)eF^f,,„.G 
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If G, p and a are understood, we omit them. If we are in a context wiiere G, /? and 
a are understood, tiien we use tiie symbols Uo and Ui as synonyms for J and U, 
respectively. 

Definition 53. Let T, a nonempty tree in T, be given. Let P be the greatest leaf in 
T; if P is labeled with a grammar symbol, then let Pa be the parent of P; if P is 
labeled with a production, then let Pa = P. Define Pa to be the attacti point for 
T. 

Definition 54. Let the TLR grammar G = (E, N, P,5, T, M), and the viable pre- 
fix yS, followed by the terminal a be given. Let k e jO, 1), and let x 6 S* such 
that |x| = k. Define gt : Ut -» T as follows: the value of Qm{U) — letting U = 
(mi , i<2, . . . , u„), and letting u„ = [A ^ a ■ y, x] — is given by one of the following|21 
cases. 

1. Assume that \U\ = 1. In this case, we must have, by Rulelslof Definition l29l 
that a = £. Let QkiU) = Tj, where Tj is the one-node tree whose node is 
labeled A ^ y. 

2. Assume that \U\ > 1, and that a e. Lei a = a'X for some X 6 (E U N). 
In this case, we let T = Qk{ui,U2, . . . , u„-i), and we let P be the attach point 
for T. We let T' be that tree formed from T by adding a child labeled X 
to the children of P, such that this new node is greatest child of P in T; let 
QkiU) = T. 

3. Assume that 1(71 > 1, and that a = 6. In this case, we let V = M2, . . . , m„_i), 
and we let Q be the attach point for V. We let V be that tree formed from 

V by adding a child labeled A — > y to the children of Q, such that this new 
node is greatest child of Q in V; let QkiU) = V'. 

Call Qt the /c-consumption function. 

Definition 55. Call the 1 -consumption function the upward linli conversion func- 
tion; denote this function Qu- Call the 0-consumption function the ancestral linli 
conversion function; denote this function 2a. 

Definition 56. Let the TLR grammar G, and the viable prefix /?, which is followed 
by a, be given. Let T* be the set of sequences of elements from T. We will define 
2d : D ^ T*. Let £» € D be (Li, L2, . . . , L„). The value of Qd{D) is given by one 
of|5|cases. 

1. Assume that n = 1. In this case, let L[ = (X[X2 . . . Xp, k). For 1 < ; < k, let 

V; be the parse tree for the derivation X, => 6; let Vj be the single-node tree 
whose node is labeled X;,. 

2. Assume that n > 1. In this case, let Lj = (X[X2 . . .Xp,k). For 1 < ; < fe, 

let V, be the parse tree for the derivation X, =^ e. Let Qd(L2, L3, . . . , L„) = 
(Yj , Y2, . . . , Yft). Let Vj. be the tree whose root: 

• is labeled /"(Li, L2), and 

• has Yi , Y2, . . . , Y;, as its children. 

Define Qd(D) = (Vi, V2, . . . , V|t). Call Qd the downward link conversion func- 
tion. 

Definition 57. If G is a context-free grammar, and T is a parse-proper tree such 
that, for every interior node N, we have that "^(N) = then we say that T is 

parse-complete. 
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Proposition 4. A parse tree is parse-complete. 

Definition 58. If G = (E, N, F, S) is an LR(1) grammar, and T is a parse-proper 
tree then we define a function Fu : T — > T, wliere Fu(T) is given by tthe following: 
for every interior node N such that ^(N) t 'i^'(N), we let Ai,A2, . . . , A„ 6 N* be 
such that ^(N) = '^(N)AiA2 . . .A,„ and we make the trees V,,V2, . . . , V„ be new 
children of N (in order), where for \ < i < n, the tree V, is the parse tree for the 
derivation A, => e. 

Proposition S.IfU is an upward link, then FuiQuW)) is parse-complete. 

Definition 59. Let {U, J, s, D) be a parse path for the viable prefix /3S followed 
by a. We will define R: V — » T. The value of R(U,J,s,D) is as follows. Let 
Tu - Fu(Qv(U)), let Y = Qd(D), and let Tj = Qa(J). There are then two cases to 
consider. 

1. If we have that \ U\ =0, then let Tq be the tree formed by: 

(a) attaching a node labeled B to the tree as the greatest child of the attach 
point for Tj ; and then 

(b) attaching each tree from Y, in order, following the node added in^| to 
the attach point for Tj. 

2. If we have that \ U\ > 0, then let Tq be the tree formed by: 

(a) attaching the root node of Ty to the tree as the greatest child of the 
attach point for Tj ; and then 

(b) attaching the root node of each tree from Y, in order, following the node 
added in|2^ to the attach point for Tj . 

Define R(I, J, s, £)) = To; we call R the parse-patti conversion function. 

Lemma 12. If( U,J, s, D) is a parse-path, then R( U, J, s, D) is parse-proper 

Proof. Let k e {0, 1), and let x e 1,' such that |x| = k. Let e Ij., where 
Pk = (Pi , Pi, ■ ■ ■ ,Pn). We proceed by induction on n. Since there are no interior 
nodes in QuiPk) when \Pi^\ < 2, we can use = 2 as our basis step. The first two 
items of P^ are of the form 

[A; — > ■A2cri,x], and 
[Aj -> ■a2,x], 

respectively. Let Tq = QuiPk)- Since the only child of the root, which is the only 
interior node, is labeled A2 — » ci'2. the child-string of the root is A2, which is a 
prefix of Alai . 

Let f J € \ be a parse-precession, such that IP^J > 2. Assume that we know 
that QkiPk) is a parse-proper tree; we now assume that IPjI = \P'i^\ -1- 1. Let Tp be 
the tree obtained after IP^I applications of Qk. There are two possibilities for p„. 

1. Assume that p„ is of the form [A — » aX • 7, x]. In this case, the attach point 
Pa will be labeled A — » aXy. The child-string of Pa will be a. The final 
application of Q\ will attach a node labeled X as the greatest child of Pa, 
forming the tree Tq. The parent of this new node — the counterpart of Pa in 
To — has the child-string aX, which is a prefix of aXy. 
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2. Assume that p„ is of the form [C — » -S, x]. In this case, the attach point of 
Tp will be labeled [Da — * (a • Ct]^,x]. The child-string of Pa is ^a- After 
the final application of Q^^, the child-string will be ^aC, which is a prefix of 
(aCtia, as required. 

Thus, we have established that Q\]{U) and QaU) are parse-proper trees. 

We turn now to the downward link. Let D = (Li,L2, . . . ,Lp). Assume that 
p = 1. In this case, the value of Qj) is a sequence of trees Y = (Ti,T2, . . . ,T,„). 
The trees Tj , T2, . . . , T„,_i are all parse trees, so they are all parse-proper. In this 
case, the tree T^ will be a single-node tree, hence, it is trivially parse-proper. 

Assume now that all elements of Qn{D') are parse-proper if \D'\ > 1. Assume 
also that p = \D'\ + I. After p - I applications of Qo, we have a downward link 
{9, kg) and a sequence Y = (V|, V2, . . . , V,). By the induction hypothesis, each 
element of Y is a parse-proper tree. The penultimate application of (2d will result 
in a sequence of ko trees: the first kg - I trees are parse-proper, as they are parse 
trees. We know that P(L[,L2) is a production, where 

P{Lu L2) = A ^(,^(V;))^(,^(V2)) . . . ^mV.^WiXi . ..X,„. 

The next application of (2d will result in a sequence of trees, all but the last of 
which are clearly parse trees; let the last tree in this sequence be V. Since the 
elements of Y are added in order, we have, letting the children of the root of V be 
labeled C i , C2 , . . . , C^t^ , we can see that 

^(C,) = ^(^(Vi)), 

if (C2) = i^(^(V2)), 

etc. 

Since this is a prefix of the right side of P(Li, L2), we have shown that every ele- 
ment of Q^{D) is parse-proper. 

There are two ways that three trees QaU) and QuiU) are combined with the 
trees in Qd(D) to form the tree R{U, J, s, D). Let i = [Ao- aa-Xcr ■ fa-]- 

1. Assume that |t/| = 0. In this case, we know that X^r = B. By Rule Hal of 
Definition l59l we add a node labeled B to the attach point. 

2. Assume instead that \U\ > 0. In this case, we know that «i = [Xa- — > ■K],a]. 
Thus, M{Qv(U)) = X^. By Rule|5a|of Definition|59| we add a tree— namely, 
the tree Q\](U) — whose root is labeled X„. kj. 

The attach point Po_a of QaU) is labeled A^- — > a^Xo-yo-; since QaU) is parse- 
proper, the child-string of the attach point must be a^. So after the application of 
either Rule^^or Rulel2al whichever is appropriate, we have shown that the child 
string Po_a now has the child-string a„.X„-. 

Now, Li is a link, either partial-downward or terminal, for y^-. We can write 

Jcr = ^1^2 • • • ^k„^k„+l ■ ■ ■ ^iv 

After the application of Rule^|or Rule|2^ as appropriate, we add the elements of 
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Qd(D) to the attach point. Clearly, 



X2 = .nmm. 



but what of and Vj,^? If Lj is a terminal link, then V^.^^ is a one-node tree whose 
node is labeled X^g = a. Otherwise, the root of V^^ is labeled P(Li, L2), which is 
A.ln either case, if(.^(V^„)) = X/.^. m 

Definition 60. Let the TLR grammar G, and the parse-proper tree T be given. 
Let L = (Ki, Kt, . . . , K„) be the set of leaves, such that Ki < K2 < • • • < K„. If 
^(K,) 6 E U N U (e) for all 1 < ; < n, then we let 

a = ^(KimK2)...^(K„). 

We call a the yield of T. 

Lemma 13. The yield ofR(U, J, s, D) is /3Ba. 

Proof. LetQo(D) = (Vi, V2, . . . , V„J. The yield of any of the trees Vi, V2, V„j_i 
is clearly 6. The yield of V„j is clearly a. 

Let k e |0, 1}, and let x € S* such that |x| = k. 

Let Pk = (pu P2, ■ ■ ■ , Pn) be a ^-parse precession. We note that the only way a 
grammar symbol leaf is added to the tree T = QkiPk) is when we evaluate Qk on 
an item of the form 

[A-^aX-l3,x] (22) 

If Pi^ , Pij, . . . , p;„, is the set of all items of the form <22t . such that i\ Kii < ■ ■ ■ < im, 
where 

Pi,: = [A/, ai,Xi, ■Pi„x], 

for I < h < m, then the yield of Qk(Pk) is evidently the string X[X2 ■ ■ -X,,,. But 
this is just the trace of Pj.. Therefore, the concatenation of the yield of Qa(J) and 
Qv(U) is /3B, by Theorem|5l 

The yield of R(U, J, s, D) is clearly the yield of Qa(J), QuW), and V,,^ , con- 
catenated, but this is just fSBa. ■ 

Lemma 14. IfR(U, J, s, D) is a parse path, then R(U, J, s, D) is a simplified tree. 

Proof. We wish to find a sentence y e L(G) such that, letting T be the parse tree 
for y, we have that TqT = R(U, J, s, D). 

Let c/K be the set of all leaves of TqT that are labeled with a nonterminal. For 
any nonterminal A, let Fa be the parse tree for the derivation 

for some za € E*. Let Ti be that tree formed from T by replacing every node 
N 6 t/K with the tree F ^(N). 

Let ^ be the set of all production nodes Jo in Ti such that the child-string 
of "^(Jo) 't =5^(Jo)- For any such node Jo, let ^(Jo) = aoZiZ2...Z„„ where 
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tto = "^(Jo); let Zj„ be the sequence of trees (Fz, , Fz, , . . . , Fz,„). Let T2 be the tree 
formed by, for each Jf e D, adding the elements of Zj^ as children of Jf , in order. 

Let z be the yield of T2. Clearly, the parse tree of z is T2. Recalling the simple- 
tree projection operator yu, consider the tree ^Ji- We will examine the changes to 
J 2 made by /j. 

For any node such that corresponds to one of the symbols in p, we note 
that Nz would be removed and replaced with a node labeled ^(N^) by /j. We would 
also remove any node that is not an ancestor of any node representing one of 
the symbols in fia, provided that > A, where A is the node representing a. 

We consider the conditions under which we will add nodes to Ti in the con- 
struction of J 2- Let Hx be a node in Tj, such that we add children to it when 
constructing J 2- There are two cases: 

1. Assume that is some node in Qa(J)- In this case, we note that in Qa(J) — 
letting Hj be the node corresponding to in QkU) — the greatest child of 
Hj is an ancestor of the greatest tree in Qd(D). Let this child be the child 
of Hj in. Clearly, the simplified tree must retain the first / children. 

2. Assume that corresponds to some node in one of the trees in QoiD). We 
must have that corresponds to a node in the final element of Qy>{D). Let 
this node be Hd. Let Ad be the node in the final element of Qu(D) that is 
labeled a. We can easily show the following: "if ^(Hd) "^(Hd), then 
the greatest child of Hd is autoancestral to Ad." Thus, any nodes added 
as children of Hd during the construction of T2 will be removed when /j is 
applied to T2. 

Can any node be removed from the part of T2 corresponding to Q\j{U)l By 
Proposition|5| we have that there are no production nodes in -Fu(2u(t/)) that would 
have children removed. 

By Lemma fTsl we can say that the only nodes which will be replaced during 
the construction of Ti from T are the nodes corresponding to symbols in yS which 
are labeled with nonterminals. Let be one of the nonterminal nodes that is 
replaced by a production nodes. The head of the production will be ^(N/j), and 
when we apply /j to T2, we will this replacement node with a node labeled J^{Hp): 
that is, we will revert to N^. 

So, we have examined all nodes and changes made applying to T2, and we 
have arrived back at T. Therefore, R{U, 7, s, D) is the simplified tree for pa. ■ 

Theorem 8. If(U, J, s, D) is a parse path, then R{U, J, s, D) is a proper simplified 
tree. 

Proof Let G = (E, N, P, 5 ) be an LR(1) grammar, and let /36 be a viable prefix, 
followed by a, where S 6 N. Let (U, J, s, D) 6 P/3b,„.g, and let T = R(U, J, s, D). 
Let A and B be the nodes in T corresponding to the symbols B and a. 

There are two ways that T may fail to be proper: it may be improper above 
either A or B. 

Assume the former, for the sake of a contradiction. Let D = (Li, L2, . . . , L„). 
This means that there are three distinct nodes Pa.i, Pa.2, and Pa.3 in T, each of 
which are ancestral to A but not B, such that Jf(PA.i) = ^(Pa.2) = ^(Pa.s)- 
Assume, without loss of generality, that Pa,i < Pa,2 < Pa,3- There are thus three 
partial downward links L,,, L;,, and L;,, where ii < i2 < 13, corresponding to Pa,i, 
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Pa,2, and Pa.3, respectively. Now, 



^(Pa,,) 

=5f(PA,3) 



P(L;,,L,2+i), and 

P(L;3, L,3 + l). 



However, the fact that P{Li^,Li^+i) = P(Li^, Lj^+i) = f(L,,,L,3+i) is in contradic- 
tion with Definition l36l Therefore, T is proper above A. 

We now consider the latter of the two ways in which T may fail to be proper: 
that is, assume that T is not proper above B. 

We know, from the previous two Lemmas, that T is a simplified tree for fiBa. 
Let W and O^d^Q be the prefix-ancestral set and prefix-ancestral interstitial se- 
quence, respectively. 

We have assumed that there is some / such that, within V,, there are three nodes 
Ci , C2, and C3 such that 



Depending on whether or not the nodes in V, are ancestral to A or not, we define k, 
X, and / as follows: 

• if the nodes in V, are ancestral to A, then let = 0, x = e, and 7 = 7; 

• otherwise, the nodes in V, are not ancestral to A, in which case we let = 1, 
X = a, and I = U. 

Either way, let / = (ri, tt, . . . , 

Note that Ci has no child labeled with a grammar symbol; thus it was created 
when Qi, was applied to an item of the form [Aj — » ■ai,x]; likewise for C2 and 
C3. The three nodes Ci, C2, and C3 thus correspond to three items r,^ , r,, , and r,, , 

respectively. When ti < qi < t^, we have that r,, > r,,+i, as required by Condition|4| 

k 

of Definition l29l yet we have that r,, = r,, = r,^ when qt - h, in violation of that 
Condition. 

Therefore, we have shown that T is proper above B because the argument of 
the last paragraph applies to Qv{U) and Qa(J)- Since T is proper above A, it is in 
fact a proper simplified tree. ■ 

4.3.2 The Interchangeability of Parse Trees and Parse Paths 

Theorem 9. There is a bijection between the set of all parse paths and the proper 
simplified tree setfi)r fia. 

Proof. Let G = (S, N, P, S, T, M) be a TLR grammar, let ^ 6 (E U N)* and 5 € N 
be such that /3B is a viable prefix followed by a. 

The bijection is R. We first begin by proving that R is surjective. 

Let Ts be a proper simplified tree for /3B followed by a. We let A and B be 
the nodes corresponding to a and B, respectively. Let S be the greatest node in 
Tj such that S is an ancestor of both B and A. Let Rb be that child of S that is 
autoancestral to B. Let Tj be that tree formed by removing Rb from Ts, along with 
the other children of S that are greater that Rb. Let T[j be the subtree rooted at 



Ci < C2 < C3, and 

^(C,) = ^(C2)=^(C3). 
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Rg. Finally, letting Vi, V2, . . . , V„, be those children of S greater than Re we let 
Y = (Vi,V2,...,V,J. 

Let Ty be that tree formed from T[j by removing all nodes if is the root 
of a subtree with 6-yield. 

Let Q-D be the right side of Jf{S). We define a function 2^' : (T* x (E U N)*) ^ 
Dnow. The value of (Yd, (Id) we give now, letting Yd = (Vd.i, Vd,2, • • • , Vdjij,). 

1 . If Vdjid has but a single node, then let L be a terminal link from q-d to a, with 
the value {ao, "d)- In this case, define (2d' (Yd, Q-d) = (L). 

2. If Vd,„o has multiple nodes, then we first let L' be a partial downward link 
from Q-D to a, with the value (q-d, "d)- Next, we label the children of the root 
of Vd.„p as follows (in order): Wi, W2, . . . , W^p, and we label the right side 
of ^(^(Vd,„ J) as a'. Let (Li , L^, . . . , L,,^) = e^'CCW, , W,, . . . , W,„„), a'). 
Define Q^HYd, ^d) = (L', Li , Lj, . . . , L,,^). 

We will pause and establish an intermediate result. Let Y € T* be such that, 
should we let Y = (Vi , V2, . . . , V,„, ), each of Vi , V2, . . . , V„j_i is a parse tree with 
e-yield and V„,, is a parse-proper tree whose yield is a; furthermore, 

.5^(.^(Vi))if (.^(V2)) . . . ^(^(V„„)) 

is a prefix of a. We will show, by induction on the height of V,„j , that 

20(60 (Y,«)) = Y. (23) 

Let the height of V^, = hy. Assume that hy = 1. In this case, V,„, has only a 
single node, and that is labeled a. Thus, the value of 2d'(Y, a) is the terminal link 
(a, mi). We have, by RuleQof Definitionl56l that 2d(2d'('*''°') = where we 
are letting Y' = (Qi, Q2, . . . , Q,,,;). However, we note that |Q;„;| = mi, and that the 
first mi - I elements of Y' are parse trees for the derivations 

Jf(^(Vi))^6, 

^(^(V„,_,))^e, 

respectively. Finally, since V,„| is the one-node tree whose root is labeled a — again, 
by RulefTlof Definitionl56l— we have <23lif/;v = 1. 

Assume now that we have established <23> if /?v = ky, where ky > I; assume 
also that hy = ky + l. When considering the evaluation of 2d' on Y, we recursively 
evaluate of gp' on (Yh, ffh). such that the final element of Yh is of height ky . We can 
therefore say, by the induction hypothesis, that QoiQ^^y h,o:h) = Y;,. We return to 
the evaluation of Qq. Note that Lj is a partial downward link for a to a, the value 
of which we denote (o-i, mj). We denote a = X[X2 . . . Xiai. Let V'j , Vj, . . . , V'^^^ j be 
parse trees for the derivations 

Xi^e, 
Xi^e, 
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and let V'„| be the tree formed by making each of the elements of Y;, children of a 
node labeled X^^ — > y. However, this is just Y, so we have il'H . 

Let k e {0, Ij, and let y e 1' such that l^^l = k. Let Tp c T be the set of 
all parse-proper trees whose leaves are either a grammar symbol or e. We define 
2^:' : Tp for some e Tp as follows. 

1. Assuming that Tx has no nodes, we let 

e*:I(Tx) = (). 

2. Assume that Tx has multiple nodes, and assume also that the greatest leaf is 
labeled by the production Ax — » ci-x- Let be that tree formed from Tx by 
removing the latter's greatest leaf, and let (i'l, i2, . • • , i„,] = Q^\(T'J. Finally, 
let 

2*.v(Tx) = (i'l, i2, . . . , i„„ [Ax ^ -Q-x-y]) 

3. Assume that Tx has multiple nodes, and assume also that the greatest leaf Nx 
is labeled by the grammar symbol X,^; let the parent of Nx be labeled E ^ y^. 
Let mx - 1 be the number of siblings of Nx, and let T" be that tree formed 
from Tx by removing Nx. Let 

o-i,72,...,ypj = e*i.(n'). 

Let 7x = Z1Z2 . . . Z,^ , and let 

/ = [E ^ Z1Z2 . . . Z,„^ ■ Z,„^+i . . . Zj^ , y] . 

Finally, let 

ft:;(Tx) = 01,72,..., /,«,/). 

We establish an intermediate result. Let k e jO, 1), and let z e E* such that 
|z| = k. Let Ty be a parse-proper tree, and whose root is a production node. We 
wish to show that 

Qk(Qt!XTy)) = Ty (24) 
Letting the number of nodes in Ty be Hy, we proceed by induction on iiy. Assume, 
for our basis step, that = 1. In this case, the only node is labeled Ay ay. The 
application of yields the sequence 

([Ay ^ ■ay,Z]y, 

this is the input of Qi;. The application of 2^., as given by RuleQof Definitionl54l 
will create a one-node tree whose node is labeled Ay — > ay. This is just Ty. 

Assume that we know that <24> holds for /Cy-node trees, where ky > I. Assume 
also that ny = ky + I. We consider the greatest leaf of Ty, a leaf that we label 
Fy. This leaf may be either a grammar symbol or a production node — we consider 
these cases separately. 

1. Assume that ^(Fy) = Cy yy. Let Ty be that tree formed from Ty by 
removing Fy; in order to evaluate Q'^i on Ty, we first evaluate Qk- on Ty. 
By the induction hypothesis, Qk(Q^[(T'y)) = Ty. The value of Q^'JTy) will 
be the item [Cy — » ■yy,z] appended to the sequence Q]^l(T'y). Consider the 
evaluation of Qi^ when we evaluate the expression 2*(2j l(Ty)): we first re- 
cursively evaluate of with the sequence of items 2^ l(Ty), which will yield 
Ty', the evaluation of 2^ will be completed by adding a node, coiTesponding 
to the item [Cy — > ■yy,z] to TJ,; this node will be labeled Cy — > yy, hence, 

e*(e*.i(Ty)) = Ty. 
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2. Assume that ^(Fy) = X, for some X € S U N. This case is similar to the 
first. The label of the parent of Fy we give the label Dy — > SyX^y, such that 
Fy has |(5y| siblings. Let T" be that tree formed from Ty be removing Fy. As 
before, when evaluating Qk(Qt\(Jy)), we form the final tree from T" and the 
item [Dy — > SyX ■ (,y,z\ \ this yields Ty. 

We can appeal to this result directly to conclude that there is a function 2^' 
such that, for any parse-proper tree T^, we have that 

eA(eA(Ta))=T,. (25) 

Similarly, there is a function 2^ ' such that 

euCGu (Tu)) = T,. 

We now return to the trees we considered in the beginning of this proof. Let 
D = 2d'(Y), and let J = g^'CTj)- Let n = Jf(S), recalling that S is a node in the 
tree T,. Let the index of the child of S that is autoancestral to B be ij,, we write 
n = F ^ rjO, such that I;;! = ij,; thus we let s = [F ^ r] ■ 6}. Finally, we turn to U: 
if Tu has a single node, then we let U = (), otherwise, Tu has multiple nodes, in 
which case we let U = 2u'(Tu). 

Let V = R(U, J, s, D). If we can show that Vj = Ts, then we shall have shown 
that R is surjective. We first note that Tj = Qa(J) and Y = Qd(D). Now there are 
two cases to consider. 

1. Assume that Tu has but a single node. In this case, we attach a single node 
labeled B as the greatest child to Tj, as in RuleQof Definition l59l but since 
Tu has but a single node, then this node must be labeled B. 

1. Assume that Ty has multiple nodes. If this is the case, then Tu = 2u(f^)- We 
note that the tree Tu,o, as specified in Definition l59l is in fact equal to T[j, as 
specified in this proof. In that Definition, we construct RiJJ, J, s, D) accoring 
to Rule|5|by attaching T[j to Tj at the place that it originally resided. 

In either case, we finish by attaching the nodes from Y following the root of T[j. 
Clearly, this yields Tj. 

Assume now, for the sake of a contradiction, that R is not injective. That is: 
assume that there are two distinct parse-paths {Ui,Ji,s\,Di) and (U2, J2, S2, D2) 
such that 

R{Ui,Ji,si,Di) = R{U2,J2,S2,D2). (26) 

There are several ways in which these parse paths could differ. We examine 
each of these ways, eliminating each in turn. 

Let Pi: and Rj, be two /c-parse paths, such that Qk{Pk) - Qk{I^k)\ let w € Z* 
such that |w| = k. We will show that Pj. = Rk- Assume, for the sake of a 
contradiction, that Pk Rk- Since the number of nodes in Qk{Pk) is \Pk\, and 
since Qk{Pk) = Qk{Rk), we have that \Pk\ = \Rk\- Let Pk = (pi, P2, ■ ■ ■ , P,,^) 
and let Rk = (ri,r2, . . . ,r„^). Let i-^ be the least index such that p,^ t r,^; let 
Ph - [Apj, -> apj, •7p,;„w] and let rk = [Apj, -» apj, •yp.;„w], for 1 < h < nji_. 
There are|2|cases to consider. 

1. Assume that 4 = 1; thus, ap = q-r = 6. Since Qk yields identical one- 
node trees when evaluated on the sequences (pi) and (fj), we must have that 
Ap.i -> 7p,i = Arj 7r_i. 
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2. Assume that > I. As = r,^-i, we have that 

ttpi-i = ttRiz-i. and 
rp,>z-i = 7R.L-1- 

We have either that: 

(a) if cfp,,^ = e, then the node corresponding to p,^ is the first child of a 
node corresponding to p,^_i; let Np in QkiPk) and Nr in Qk(Rk) be the 
nodes corresponding to p,^_i and r,^-!, respectively; the first children of 
Np and Nr, respectively, is labeled 

A[ — > 7i and A2 — > 72, 

such that these two productions are identical; thus, p,^ = r,^; 

(b) if ap i^ + 6, then 

^p,/z-i -> »p..z-i7p.-z-i = '^p,,, -» ap.,z7p..z 

such that 

Iq-Pjz-iI + 1 = IQ'PjzI; 
note that the node corresponding to pi^_i is identical to the node corre- 
sponding to and that the least sibling of p,^_i that is greater than 
p,^_i corresponds to p,^, and is labeled by the grammar symbol 7, such 
that 

ttPj.-iK = ffp,,^; (27) 

the node corresponding to must have a sibling identical to the one 
corresponding to p,^; this sibling corresponds to r,^, so we conclude 
from )27t that p,^ = r,^ . 

In all of these cases, we see that we have = Rk- 

We have established a result that allows us to claim that Ui = Ui and that 
y, = Jo. 

Assume that Di Dr, we clearly have that QoCfi) = Qd{Di) — from this, we 
will derive a contradiction. Let H be the height of the last element of 2d(^i); thus 

\D,\ = H = \D2\. 

Let Di = (Li, Lt, . . . , Lh) and let D2 = (L\ , L^, . . . , L^), and let be the least 
index such that Li , # L', . There are two cases. 

1. Assume that = 1. Let 2d(^i) = (Yl.i, Yl.2' • • ■ > Yuny)', we have that 

n = ^C^(YLa))^(^(YL,2)) . . .^(i^(YL,n^)). 

Since Li = (jl, ieD(£>i)l) and let L\ = (jl, ieD(£»2)l), and since \Qd{Di)\ = 
\Qd(D2)\, we conclude that Lj = L'j. 

2. Assume that k^ > I. Let Tj and be the final elements of Qo{Di) and 
QoiDi), respectively. In this case, let be the node in Tj such that Nj is 
the greatest node with exactly \Di | - k^ ancestors; let Nj be similarly defined 
for Tj. The parents of the nodes Nd and Nj, corresponding as they do to the 
identical links L^j-i and are identical, and identically labeled with the 
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production Al ttL- Both L^^ and L'j.^ are links, either partial-downward or 
terminal, for aj^. Let the number of siblings of Nj be n^, which is also the 
number of siblings of N^. Clearly, 

= (ffL,«d), and 

Thus, conclude that Lj-^ = L'j^^. 

As we have a contradiction either way, we have therefore that Di = D2. 

The only other possibility is that Ji t S2. Let Pj be the attach-point of Qa(Ji), 
and let ^(Ps) = As — » a^Xf,y^ such that Pj has \a^\ children. The child-string of Ps 
is clearly a^, and Lj (the first element of D[ or D2) is a link for y^. Therefore, 

'^1 = [As ffs^s • 7s] = ^2- 

Since (Ui, Ji, Si, D[) = {U2,Ji-,S2, D2), we have that R is injective, and there- 
fore, a bijection. ■ 

Theorem 10. Let G = (L, N,P,S ) be an LR( 1 ) grammar, let [}B be a viable prefix 
followed by a, for 5 e N, and let V be a parse-path for fiB. If n is a production, 
then 

Lsifin) = M{R(V),n). 

Proof LetP = (U,J,s,D). 

Let U = (ui,U2, ■ ■ ■ , u„), let J = (j\,j2,---, jm), let s = [E ^ rj ■ 6], and let 
D = (Li,L2, . . . ,Lp). For 1 < ;„ < n, let = [A,„ — » •y;^,a]; likewise, 
for 1 < in, < m, let jj^ = [C,„, — » ■ finally, for I < ip < p, we let 
L,p = (0,p,fc,p). Let T be the proper simplified tree for /3B followed by a. Let 
Tu = Qvm, let Tj = Qj,{J), and let Y = Qo(D). 

Let ;r = A — > a be a production, and let L\: f(n) = ny. We will show that 

Lyfin) < M(T,n). (28) 

If ;jv = -1, then M(T, ;r) > n\ very trivially. So assume that ;jv ^ -1. There 
are six possibilities. 

1. Assume that Vjj(n) = n\j. In this case, there is some ;„ such that 

A,„ — > Qr,„7,„ = n. 

Now 

"'u = ['4.„ • 7.,,,"]; 

let = - lor, J; evidently, 

which is to say that = e. When we evaluate Qu on the item , a node 
labeled n gets added to 2u(("i, "2, • • • , ";„-i)); let this node be Nu. Since N„ 
is an ancestor of B, but not A, then by Definition l45l we have that M(T, tt) = 

l7/ul + l- 
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2. Assume that Vn^^in) = ny. This means that n is used in the derivation 

7i, => f , 

for some \ < < n. There must, according to Definition |59| be some node 
labeled n in on of the parse trees attached to to form Tu o> as specified 
in the same Definition. Let Ny o be this node. Since o shares an ancestor 
(corresponding to the attach-point of Tj) with both B and A, we have that 
M{J,Ti) = |a| + 1. 

3. Assume that Vo(^) = "v- This means that there is some \ < ki < m such 
that Ctj — » 5tj<r*j = We have in this case that 

14,1 + 1 =nv. (29) 

As every production node of Tj is the greatest of its siblings, we see that 
only the greatest child of a production node can be autoancestral to B. Let 
k' = ki - |(5j.J; When we evaluate 2a on ;V, we add a node labeled n to 
Tj — let this node be Nj. There will be at least |(5tj| children of Nj. Thus, by 
DefinitionliSl 

//t,(Nj) = |(5,J+1, 

and so M(T, n) = l^^tj | + 1 . 

4. Assume that V<j/(7r) = ny This means that there is some I < h\i < p such 
that 

P(Lh^,Li,^+,) = n, 

where 

'^Av+i = "V- (30) 
When evaluating 2d on L;,^, we create a sequence of trees: the final tree 
Td in this evaluation will have a root labeled n, such that this root has ki,y+i 
children; by Definition l45l the function 

We have, by that 

M(Tj),n) = nv 

5. Assume that V'p,e(;r) = ny This means that there is some \ < h'y < p, where, 
writing 9;,^ = X1X2 . . . Xf^^, we have that n is used in the derivation 

X/o^f, (31) 

for some 1 < Id < No- When, according to Definition |5^ we evaluate 2d 
on the downward link L;,^, we create a parse tree for derivation <31> : in this 
parse tree, there will be a node P/j^ which is labeled n. This node Po ^ shares 
an ancestor with both B and A; moreover, B < Pd.je < A. Therefore, we have 

Hj(7i) = Iq-I + 1 = nv 

6. Assume that Vn(n) = «v In this case, we first note that n = E rjO. When 
we join the two trees Tj and Tu to the trees in Y, we find that the node that 
was the attach-point of Tu now has I77I - 1 children inherited from Tj , one 
child corresponding to either the sidelink or the root of Tu, and one child for 
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each of the elements of Y. There children are added in the order listed: note 
that the final element of Y corresponds to a node that is autoancestral A. Let 



i's = m - \r]\ - I; 

since \tj\ > 1, we know that 

= [£ ^ -Tjei 

Let Ps be the node corresponding to this item (added during the evaluation 
of Qa); we have already established that Ps is an ancestor of A with I77I + |Y| 
children, the last of which is autoancestral to A. Therefore, according to 
Case|6|of Definition|45| we have that 

//t(Ps) = MV. 

Therefore, we have established )28> . 

Let = C — * 7 be some production, and let M(T, 0) = iij. We dispense with 
the possibility that wt = -1, as it is trivial; thus, assume that nj > -1. There must 
be some node P^ in T such that Jf{P^) = c/>, such that, 

Ht(P^) = nj. 

There are three cases to consider for the relationship of with the nodes B and A 
(with some subcases). 

1. Assume that P^ is an ancestor of B, while P^ is not an ancestor of A. By 
the construction of Tu,o in Definition l39l we can see that P^ has I7I children 
and so, the value of Hj{P^) is \y\ + 1. There must be some item ;<b in U 
corresponding to P^; this item will be of the form 

;<B = [C - J, a]. 

By Definition l38l Condition^ we can see that Vn(<^) = I7I + 1- 

2. Assume that P^ is an ancestor of A. In this case, we have that P^ corre- 
sponds to some element of (U,J,s,D), according to one of the following 
possibilities. 

(a) There could be some node in one of the elements of Y corresponding 
to P^. This node, which we call Py, is in the last element of Y, a 
tree which we label Ty. Let the number of ancestors of Py in Ty be 
/la. When evaluating Qa on D, we add the node Py — corresponding to 
Py — during the evaluation of Qo on This node P^ will be the 

root of a subtree of Ty ; a subtree which was created when evaluating 
2d on Since P^ has kk.^+t children, 

"T = h./, 

thus, since /"(L/i^+i, L/,,,+2) = <P, we have therefore that 
Vt(D) = h^+i = ut. 
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(b) Assume that the child of which is autoancestral of A does not cor- 
respond to any node in Tj. Let Pa be the node in Tj corresponding to 
P^; note that Pa has nj - 1 children in Tj. We added Pa to Tj when 
evaluating Qa on the item j„i_„^+i. Let j„,-„j+i be of the form 

[D, --Tj/Zj]; 

evidently, j„, is of the form 

[Oj ^ i, ■ ml 

We know that Izyj] > 1, and so we write rji = Xjrj^; we must have that 

s = [D, ^ (,X, ■ ri,] = [E IT e]. 

Now, P(j will have one child corresponding to each of the children of 
Pa; additionally, it will have one P^, such that P;. is either labeled B, or 
Pc is the node corresponding to the root of QuiU); finally, it will have 
one child for each of the elements of Y. Since |Y| = fci, we have that 

"T = \ri\ + 

thus, by Definition l38l we conclude that 

VnW = ni- 
ce) Assume that we do not have Case l2bl and that there is some item in J 
corresponding to P^. This item is of the form 

y-A = [C ^ -yl 
As P(j has - 1 children, we must have that 

where 

\6j\ = m- \ 

such that (5j^j = y. Therefore, according to Definition l38l 
V4,W = \Si\ + 1 = nx. 
In all three of these cases, we find that 

Lv(0)>M(T,0). 

3. Finally, assume that P^ shares an ancestor with both B and A, but is an 
ancestor of neither, yet B < P < A. There are two ways that this can happen. 

(a) Assume that P^ shares an ancestor with B but not A, such that B < 
P < A. Among all such ancestors, there is one, which we label Qb, 
such that Qb is the root of a subtree that shares no nodes with Q\]{U), 
save Let Tu o be an in Definition|59| Let the node in Ty o corresponding 
to P^ be Pu.o- Let Rb be that child of Qb that is autoancestral to B; 
by the construction of Tuo, we can see that the subtree rooted at Rb 
corresponds to one of the derivations 

for some 1 <kQ <n. Therefore, 

Vn..W = l7l + l. 
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(b) Assume that shares no ancestor with B that is not also an ancestor of 
A. Let Y = (Vi , V2, . . . , Vj.| ). Let correspond to a node Py.f in any 
of the trees in Y. If Py,e is in any of Vj, Vt, . . . , Vj.|_i, say, V;,^, then 
this node Py.f is the root of a subtree with yield e in the tree Mi,^. If 
corresponds to a node in Vj.| , since the yield of Vj.| is a, and Py.s is not 
an ancestor of A, we have that Py.^ is again the root of a subtree with 
yield e. Therefore, 

Vr.W = lrl + i. 

In all three of these cases, we have established that 

Lv,pW>M(T,<^). (32) 

Therefore, by ^ and we have that 

Lv.p(0) = M(T,,^). ■ 

Thus, we may say that P and Tp are "isomorphic" under R, with respect to 
conservation of productions. 

5 Related Work 

Going back to the first days of high-level computer languages, the general idea 
of a computer language whose parser could modify itself — a construction called 
an "extensible language" — was considered and tried numerous times. However, 
these efforts were not always met with success. Perhaps it was because compiler 
construction as a discipline itself was not well understood, or that the appropriate 
formal language theory had not been developed, or that the extensible compilers 
were not powerful enough: for whatever reason, extensible languages have largely 
fallen by the wayside. 

One of the first serious extensible language projects was a variant of Algol 60 
called IMP 1181 . Along with IMP, another well regarded extensible language was 
ECL 1341 . These languages allowed programs to (in modem parlance) specify 
new productions for the language, and supply a replacement template, much in 
the maimer of a macro definition. The parsers for such languages were appar- 
ently complex, ad hoc, and arcane affairs; a programmer wishing to extend such a 
beast needed to understand a fair bit of the internals of the parser to extend it and 
understand the cause of problems. 

An often expressed goal for an extensible language would be to allow a pro- 
gram to supply a new data type, along with associated infix operations, so that 
programs dealing with matrices or complex numbers could use the natural syn- 
tax. The modern approach to this problem is to use operator overloading in an 
object-oriented language. This begs the question: why would a programmer want 
to engage in the difficult endeavour of modifying the parser when a mechanism 
like operator overloading suffices? 

The high point of interest in extensible languages was likely the International 
Symposium on Extensible Languages. The Proceedings of this Symposium 1301 
contain several reports on real-world extensible languages: there are many reports 
on languages like ECL (e.g. |34|, |6|, and |28|); some more general works on 
macro systems (e.g. 1271 . and 1 15 |); and some survey works (e.g. 1141 . and 1121 ). 
The mood was upbeat, but a little over-optimistic; indeed, as Cheatham put it. 
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extensible languages had delivered: "there exist languages and host systems which 
fulfill the goals of extensibility" 1 12|. But not everyone was upbeat: there were 
reports of failures — not of implementation, but of extensibility itself |32j. 

The extensible language concept did not disappear, even after the appearance 
of languages like C++. One example is |9|, a language that turned out to be only 
partially successful, complex to use, and crippled by performance problems. 

Some of these systems use a self-modifying compiler, and operate in a single 
pass, while others use a two-pass compiler. Almost all of them do a textual substi- 
tution, at least on the conceptual level. We would therefore consider the study of 
these systems to be a study of macro systems: if a macro system admits patterns 
that are more complex than a function call (e.g. the CPP macro system for C and 
C++ requires macros to be of the form MCR0_NAME(PARAM1, PARAM2 , ...)), 
then we may say that the macro system is a syntactic macro system — otherwise, 
we say that the macro system is a simple macro system. The study of syntac- 
tic macros has continued in its own right, and syntax macros are present in some 
modem languages, including Scheme 1251 . 

As interesting summary of the issues related to advanced macro systems is due 
to Brabrand and Schwartzbach |5 1, who summarize prominent macro systems, and 
present a new one of their own. The macro system presented in 1 5 1 operates on 
partial parse-trees, which illustrates the necessity of a macro-aware parser. 

The aforementioned syntactic macro systems can use macros in very powerful 
ways, achieving many of the goals of the early extensible languages. It is certainly 
possible to implement a language with syntax macros using a parser for what we 
have in the present work termed a transformative parser, which would allow for an 
implicit macro call. This has been done in 1 10 1, where a syntax macro program- 
ming language implemented using a transformative LL parser is described. The 
latter system is powerful enough to extend a functional language into an imper- 
ative language (like C), and it avoids the problems associated with many macro 
systems. 

Most likely due to their syntactic simplicity (even austerity), syntax-macros 
are usually reserved for functional programming languages. That is not to say that 
they cannot be used for a syntactically-rich language like C. Exactly this was done 
for C by Weise and Crew 1 35 1; one point of note is that, rather than supply a static 
template with which to replace the macro invocation, their system allows for the 
replacement to be generated by running procedural code on the macro parame- 
ters; the replacement will be an abstract syntax tree. Allowing the macro body to 
include code, which will be run (most likely by an embedded interpreter) during 
compilation, could be called compile-time computation. Other macro systems 
allow this — Scheme most notably. Another system which allows for compile-time 
computation is C++: it has been discovered that the C++ template system is Tur- 
ing complete |33j. 

The choice to allow compile-time computation in the macro body has signifi- 
cant advantages — see 1 20 1 for a survey of partial evaluation — but there are many 
drawbacks. The biggest drawback is the increased complexity of having two lan- 
guages side by side: the run-time language and the compile-time language. 

The present work is part of an effort to make a real-world programming lan- 
guage using a transformative LR(1) parser. This programming language would, 
it is hoped, prove useful to developers of domain-specific embedded languages 
(DSEL). The subject of DSEL is of much interest today: for example, see the re- 
cent survey piece by Mernik, Heering, and Sloane 1241 . In order to develop DSELs, 
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it is usually necessary to make moditications to the compiler's source code, a task 
hopefully made easier by using a programming language with a transformative 
LR(1) parser at its core. 

Another task which requires the modification of a compiler's source code is the 
extension of a programming language to add new features: for example, adding 
aspect-oriented capabilities to Java f^. This is done often enough (especially with 
Java in recent years) that software systems dedicated to this task have appeared 
1261 . Indeed, this was one of the original motivations for early extensible languages 
['29 '|. However, for the Java systems mentioned above, the compiler is extended 
prior to compilation, so we may term this an offline grammar transformation. 
Some other modem systems do allow for transformation during parsing, but a spe- 
cial macro invocation must be used to tell the parser to launch a subparser — i.e. 
the parse is not self-modifying. This can be done with quoting, as in 1 22 1. 

The formal basis for the TLR parsing algorithm is the theory of LR(A^) lan- 
guages. Introduced by Knuth L21J . his original paper is insightful; another refer- 
ence for the theory of LR{k) parsing is (2|. Coming at LR(k) parsing from the 
more practical side is the classic "Dragon Book" |T| (for k = 1); this last work 
is particularly recommended. We base our transformative language on the LR(1) 
languages because the parsing algorithm for this class of languages is well-suited 
for a transformative language parser: since a substring of a sentence can define 
the syntax for some substring immediately to its right, we evidently want to scan 
sentences from left to right; a backtracking algorithm is undesirable because it is 
complex and expensive to backtrack past a point at which a grammar transforma- 
tion occurred; finally, bounded lookahead is important because until a decision is 
made as to whether or not a grammar transformation will take place at a certain 
point, it is unknown which (context-free) grammar has the lookahead under its 
purview. Also, in practice, LR(1) parsers are designed to execute code fragments 
after reduction by certain productions: this is a natural place to insert the grammar 
transformation algorithm — as indeed, we have done in the present work. 

It is not surprising that a parser for a transformative language based the LR(1) 
languages has been presented before. Burshteyn 1 7 1 formalizes an idea of modi- 
fiable grammars — roughly equivalent to a transformative grammar. Much of the 
present work is concerned with allowable transformations: in Section 12.4.11 we 
saw the negative consequences of admitting completely arbitrary transformations. 
In the language generated by a modifiable grammar is equivalent to the naive 
language considered in Section l2.4.1l so it does not avoid those pitfalls. 

We do note that, if we only add or remove a few productions to or from a gram- 
mar, the LR(1) parsing tables for that grammar do not change "too much." The al- 
gorithm we present in the present work requires the parsing table to be completely 
regenerated upon acceptance of a grammar transformation. This need not be the 
case: indeed, in |7 1, the parser is modified only inasmuch as the grammar trans- 
formation (our terminology) requires it; the method of incremental LR(1) parser 
generation is originally due to Heering, Klint, and Rekers 1 16|. 

The canonical LR(1) parser, creating by way of the method of Knuth 1211 . 
is often eschewed for the LALR(l) parser, owing to the fact that the LALR(l) 
parser (if it even exists) has much smaller parsing tables — however, Spector if sTl 
showed how to construct a different LR(1) parser that is often similar in size to the 
LALR(l) parser. It would probably be a profitable exercise to investigate the use 
of these techniques in the TLR parsing algorithm for a real-world system. 

The conventional view is that programming languages are not context-free: 
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that is, a program which uses an undeclared identifier is considered to be syn- 
tactically well-formed (and hence in the context-free language generated by the 
grammar) but semantically meaningless.' In order to catch this semantic gaffe, the 
parser performs a separate (at least in principle) semantic analysis phase, which 
is usually performed by procedural code; an alternative, declarative approach to 
syntactic and semantic analysis is to instruct the parser to add a new production 
for an identifier when it is declared, later to be removed when that identifier goes 
out of scope. A parser which modifies its grammar to remove the need to perform 
semantic analysis will be herein referred to as an adaptable grammar. Two no- 
table attempts at adaptable grammar systems are |8| and |4|; the field is surveyed 
in 1131 . There has been renewed interest in adaptable grammars: see II II and 1191 . 

We have not been terribly clear about the differences between syntax and se- 
mantics: if we (rightly) assume that syntax is what the parser does, then processing 
of identifiers and scopes is evidently syntax, assuming a powerful enough parser 
It is in fact very difficult to delineate syntax and semantics 1 23 1. 

It should be possible to make many types of systems on top of a TLR parser. In 
this section, we have discussed: extensible languages, syntax macros, and adapt- 
able grammars. The techniques in the present work should be general enough to 
be used to achieve any of these three techniques. Systems with these as their goals 
have not fared well: it is to be hoped that the problem in the past was a lack of 
understanding of the fundamental parsing issues; which will hopefully be obviated 
by TLR techniques. 

Conclusion 

There has been much interest over the years in languages with features — like syn- 
tax macros, extensibility, and adaptable grammars — that are incompatible with a 
parser generated from a static grammar. Numerous as hoc efforts have been made 
to make systems with these features and a parser whose grammar is not fixed, with 
little success. One problem with these earlier attempts is a lack of understanding 
of the consequences of changing the grammar — a deficiency that this work hopes 
to address. 

Another explanation may be that the features are not audacious enough: why 
would someone want to trouble themselves with the arcana of the parser to achieve 
something adequately accomplished by operator overloading, templates, or seman- 
tic analysis? If, however, the features are compelling enough, then programmers 
might be willing to write grammar transformations. We attempted, in the present 
work, to take some of the mystery out of parsing a transformative language. 

Naive transformative languages, while straightforward and occasionally the 
subject of study, do not allow us to make guarantees about halting. Requiring valid 
transformations does allow us to make guaranteeds about halting, although the test 
for validity is complex. We observe that the utility of this test does not end with 
ensuring halting of compilation: should Algorithmic] fail, then we know that the 
transformation is invalid — a report on which productions were not conserved could 
be a useful diagnostic; also, the stack represents the different ways that parser is 
trying to match the sentence, and requiring valid transformations means that, once 

'Although there is the view that a program containing an "undeclared identitier" error is semantically 
well-formed; we could think of the error message produced by compiling it as the semantic value of the 
program 121 1 . 
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a parser starts trying to match a production, that it cannot go back and reinterpret 
what it already saw. 

Many questions remain to be answered. 

The central result of the present work is the correctness of Algorithm^ — see 
Theorem 13 — this correctness relies on the transformations emitted by the A-ma- 
chine all being in "V; is there a larger set of transformations which could fill the 
role played by 1^? 

We allow a full Turing machine to form the basis of a A-machine. Theorem 3 
of 1 7 1 states (in part) that: "Each automatic BUMG (bottom-up modifiable gram- 
mar) accepts a context-free language;" in that work, an automatic BUMG is the 
counterpart of a TLR grammar whose A-machine is essentially a finite automa- 
ton. We therefore ask: what class of languages are generated by TLR grammars 
whose A-machines are (essentially) finite automata? What class of languages are 
generated by TLR grammars whose A-machines have bounded tapes? 

The most important question to answer is this: can a practical transformative 
programming language be constructed? 
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